5.local community detection algorithm based on minimal cluster

Venkat Java Projects
Mobile:+91 9966499110 Visit:www.venkatjavaprojects.com
Email:venkatjavaprojects@gmail.com
1
A Project Thesis on
Local Cluster-Based Community Detection Algorithm
Project work submitted in partial fulfillment of the requirements for the award of
the degree of
MASTER OF COMPUTER APPLICATIONS
in
COMPUTER SCIENCE AND ENGINEERING
Submittedby
Regalla Sairam Reddy
[Roll No.17021F0005]
Under the supervision of
DR.M.H.M KRISHNA PRASAD
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIVERSITY COLLEGE OF ENGINEERING KAKINADA (A)
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY KAKINADA
KAKINADA-533003, A.P, INDIA
2017-2020

2
KAKINADA-533003, A.P, INDIA 2017-2020
DECLARATION BY THE STUDENT
I hereby declare that the work described in this thesis, entitled “A
LiteratureStudy and Local Cluster-Based Community Detection Algorithm”,
which is being submitted by me in partial fulfillment of the requirements for the
award of degree of Master Of Computer Applications in the department of
Computer Science & Engineering to the University College of Engineering
(Autonomous),Jawaharlal Nehru Technological University, Kakinada – 533003,
A.P., is the result of investigations carried out by me under the guidance of
Dr.M.H.M Krishna Prasad, Assistant professor of Computer Science and
Engineering, University College of Engineering, Jawaharlal Nehru Technological
University, Kakinada-533003.
The work is original and has not been submitted to any other University or
Institute for the award of any degree or diploma.
Place: Kakinada Signature:

3
Date: Name: Regalla Sairam Reddy
Roll No:17021F0005
KAKINADA-533003, A.P, INDIA 2017-2020
CERTIFICATE FROM THE SUPERVISOR
This is to certify that the thesis entitled “A Literature Study and Local Cluster-
Based Community Detection Algorithm”, that is being submitted by Regalla
SairamReddy, Roll No.17021F0005, in partial fulfillment of the requirements for
the award of degree of Master of Computer Applications in the Department of
Computer Science &Engineering to the University College of Engineering
(Autonomous), Jawaharlal Nehru Technological University Kakinada is a record of
bona fide project work carried out by him under my guidance and supervision.
Signature of the Supervisor
Dr.M.H.M Krishna Prasad

4
KAKINADA-533003, A.P, INDIA2017-2020
CERTIFICATE FROM THE HEAD OF THE DEPARTMENT
This is to certify that the thesis entitled “A Literature Study and Local Cluster-
Based Community Detection Algorithm”, that is being submitted by Regalla
Sairam Reddy, Roll No.17021F0005, in partial fulfillment of the requirements for
the award of degree of Master of Computer Applications in the Department of
Computer Science &Engineering to the University College of Engineering
(Autonomous), Jawaharlal Nehru Technological University Kakinada is a record of
bona fide project work carried out by him at our department.
Signature of Head of the Department
Dr. D. Haritha

5

6
ACKNOWLEDGEMENTS
The successful completion of any task is not possible without proper
suggestion, guidance and environment. Combination of these three factors acts
like backbone to my “A Literature Study and Local Cluster-Based Community
Detection Algorithm” project.
I wish to express my sincere gratitude to Dr. M.H.M. Krishna Prasad,
Professor, Department of CSE, University College of Engineering, Jawaharlal
Nehru Technological University Kakinada for giving me his guidance, immense
support and valuable suggestions during my project.
I would like to express my grateful thanks to Dr. D. Haritha, Professor and
Head of the Department of CSE, University College of Engineering, Jawaharlal
Nehru Technological University Kakinada whose immense support encouraged
completing the project successfully.
My Sincere thanks to all the teaching and non-teaching staff of Department
of Computer Science and Engineering for their support throughout my project
work.
Finally, I would like to thank my family and friends for their support and
motivation, which helped me to complete the project successfully.
Regalla Sairam Reddy
Roll No: 17021F0005,
Master of Computer Applications,
University College of Engineering Kakinada,
Jawaharlal Nehru Technological University,
Kakinada – 533003,
East Godavari District,A.P.

7
CONTENT
Acknowledgement 5
Abstract 6
Contents 7
List of Tables 8
List of Figures 9
CHAPTER -1 INTRODUCTION 1-23
1.1 What Is Software 1
1.2What Is Software Development Life Cycle
Bug Prediction 2
CHAPTER-2
LITERATURE REVIEW
2.1 Related Work5
CHAPTER 3
PROBLEM IDENTIFICATION AND OBJECTIVE
3.1 Problem Statement 8
3.2 Project Objective 8
CHAPTER 4
METHODOLOGY
4.1 Methodology 9
4.1.1using Python Tool On Standalone Machine Learning
Environment 9
4.2 Data Description 10

8
4.3 Evaluation Criteria Used For Classification 11
4.3.1 Confusion Matrix
12
4.3.2 Accuracy And Precision
12
4.3.3 Recall And F-Square
13
4.3.4 Sensitivity, Specificity And Roc 13
4.3.5 Significance And Analysis Of Ensemble Method In
MachineLearning 14
4.4 Uml Diagrams 15
4.4.1 Use Case Diagram 16
4.4.2 State Diagram 17
CHAPTER 5
OVERVIEW OF TECHNOLOGIES
5.1AlgorithmsUsed 17
5.1.1 Decision Tree Induction 17
5.1.2 Naïve Bayes 21
5.1.3 Artificial Neuralnetwork 22
5.1.4 Support Vector Machine Model 25
5.1.5 Kernal Functions 28
5.2 Tensorflow 28
CHAPTER 6
IMPLEMENTATION AND RESULTS
6.1 Framework Design 31
6.2 Coding And Testing 34

9
CHAPTER 7
CONCLUSION40
Reference 42
LIST OF TABLES:
Table No Table Name Page No
1 Cluster based community
detection
27
2 Generic algorithms based
community detection
30
3 Label propagation based
community detection
31
4 Clique based methods for
overlapping community detection
35

10
LIST OF FIGURES:
Figure No Figure Name Page No
5.1.1 Decision Tree induction 17
5.1.2 Naïve Bayes Algorithm 21
5.1.3 Artificial Neural Networks
Algorithm
23
5.1.4(i) Support Vector Machine
Hyperplane
25
5.1.4(ii) Support Vector Machine
Hyperplane
26
5.2 TensorFlow in UML 29
4.3.1 Use Case Diagram
4.3.2 State Diagram
4.1.1 Block Diagram Of Proposed
Work
4.2.5 Classification Model
6.1 Framework Design
6.2 Bar graph of
knn,svm,naviebayes

11

12

13

14

15
ABSTRACT
In order to discover the structure of local community more effectively,
this paper puts forward a new local community detection algorithm based on
minimal cluster. Most of the local community detection algorithms begin
from one node. The agglomeration ability of a single node must be less than
multiple nodes, so the beginning of the community extension of the
algorithm in this paper is no longer from the initial node only but from a
node cluster containing this initial node and nodes in the cluster are
relatively densely connected with each other. The algorithm mainly includes
two phases. First it detects the minimal cluster and then finds the local
community extended from the minimal cluster. Experimental results show
that the quality of the local community detected by our algorithm is much
better than other algorithms no matter in real networks or in simulated
networks.

16
CHAPTER 1
INTRODUCTION

17
Introduction to Community Detection:
Recently, many researchers notice that the complex network is a
proper tools to describe variety of complex system in the real world , and
thus the complex network has attracted the great attention in many fields
such as physics, biology and social network et al. In complex network field,
one of the important topology property is community structure which
comprise of densely connected nodes, and some researchers have found that
detecting community structure can reveal some valuable insights of the
functional feature in the complex system . For example, communities in
multimedia social network may imply people with the same hobby and trust
relationship. Zhiyong Zhang et al. proposed an approach to analyse and
detect credible potential path based on community in multimedia social
networks , the approach can effectively and accurately mine potential paths
of copyrighted digital content sharing. Zhiyong Zhang et al. also proposed a
trust model based on small world theory which shows the widely application

18
of community struction The community structure in biology field may
cluster proteins with the same function. So that many methods have been
proposed to reveal this topological property in complex networks.
Community detection on complex networks has been a hot research field.
Recently, a large number of algorithms for studying the global structure of
the network are proposed, such as the modularity optimization algorithms ,
the spectral clustering algorithms , the hierarchical clustering algorithms ,
and the label propagation algorithms . However, with the continuous
expansion of complex networks, it is easy to collect large network dataset
with millions of nodes. How to store such a large-scale dataset in computer
memory to analyze is a huge challenge for scholars. The calculation for
studying the overall structure of this kind of large-scale networks is
unimaginable. So local community detection becomes an appealing problem
and has drawn more and more attention The main task of local community
detection is to find a community using the local information of the network.
Local community detection has good extensibility. If the local community
detection algorithm is iteratively executed, more local communities can be
found and the whole community structure of the network can be obtained.
The time complexity of this kind of global community detection algorithm is
dependent on the efficiency and accuracy of local community detection
algorithms, so the research of local community detection algorithm still has
a long way to go. There are several problems that need to be solved in the
research of local community detection. First, we should determine the initial
state and find the initial node for local community detection, so as to
determine the needed local information; then, we need to select an objective
function, and through continuous iterative optimization of the objective
function we find the community structure with high quality; after that we
need to find a suitable node expansion method, so that the algorithm can
extract the local community from the initial state step by step; finally, in
order to terminate the algorithm, a suitable termination condition is needed
to determine the boundary of the community.

19
Most of local community detection algorithms are based on the above-
mentioned process. The definition of local community detection is to find the
local community structure from one or more nodes, but most of the existing
local community detection algorithms, including Clauset , LWP , and LS , are
starting from only one initial node. They greedily select the optimal nodes
from the candidate nodes and add them into the local community. LMD
algorithmextends not from the initial node but from its closest and next
closest local degree central nodes. It discovers a local community from each
of these nodes, respectively. It still starts from single node and discovers
many local communities for the initial node. In general, the aggregation
ability of a single node is lower than that of multiple nodes. So we do not
just rely on the initial node as the beginning of local community expansion.
Our primary goal is to find a minimal cluster closely connected to the initial
node and then detect local community based on the minimal cluster. This
can avoid instability because of the excessive dependence on the initial node.
In this paper, we introduce a local community detection algorithm based on
the minimal cluster—NewLCD. In this new algorithm, the beginning of
community expansion is no longer from the initial node only, but a cluster of
nodes relatively closely connected to the initial node. The algorithm mainly
consists of two parts: one is the detection of the minimal cluster, and the
other is the detection of the local community based on the minimal cluster.
At the same time, the algorithm can be applied to the global community
detection. After finding one local community using this algorithm, we can
repeat the process to obtain the global community structure of the whole
network.
Community Detection
The concept of community detection has emerged in network science
as a method for finding groups within complex systems through represented
on a graph. In contrast to more traditional decomposition methods which
seek a strict block diagonal or block triangular structure, community

20
detection methods find subnetworks with statistically significantly more
links between nodes in the same group than nodes in different groups
(Girvan and Newman, 2002). Central to community detection is the notion
of modularity, a metric that captures this difference:
(1)Q=12m∑i,jAij−kikj2mδgigj
Here, Q is the modularity, Aij is the edge weight between nodes i and j, ki is
the total weight of all edges connecting node i with all other nodes, and m is
the total weight of all edges in the graph. The Kronecker
delta function δ(gi,gj) will evaluate to one if nodes i and j belong to the same
group, and zero otherwise. Modularity is a property of how one decides to
partition a network: networks that are not partitioned and those that place
every node in its own community will both have modularity equal to zero.
The goal of community detection, then, is to find communities that maximize
modularity. Although modularity maximization is an NP-hard integer
program, many efficient algorithms exist to solve it approximately,
including spectral clustering (Newman, 2006) and fast unfolding (Blondel et
al., 2008). Figure 1 shows a few networks with increasing maximum
modularity; note how the community structure becomes increasingly
apparent as this value increases. Previous efforts in our group have applied
community detection to chemical plant networks by creating an equation
graph of the corresponding dynamic model (Moharir et al., 2017). By doing
so, communities of state variables, inputs, and outputs can be obtained
which are tightly interacting amongst themselves but weakly interacting with
other communities. As such, these communities can form the basis of
distributed control architectures (Jogwar and Daoutidis, 2017) which
typically perform better than other distributed control architectures that one
may obtain from “intuition” (Pourkargar et al., 2017). For a more
comprehensive review of the use of community detection in distributed
control, we refer the reader to Daoutidis et al., 2017. An alternative method
for finding communities for distributed model predictive control is to apply a
decomposition on the optimization problem as a whole (Tang et al., 2017).

21
Figure 1. Networks of different maximum modularity.
However, community structure can exist in all optimization problems, thus it
makes sense to extend this method to any generic optimization problem, and
this work proposes to do so. The advantage of using community detection to
find decompositions is that subproblems generated will have statistically
minimal interactions, through complicating variables or constraints, and
thus require minimal coordination through the decomposition solution
method. The proposed method is generic, applicable to any optimization
problem or decomposition solution approach, and scalable, using
computationally efficient graph theory algorithms.
Automatically identifying groups based on network
clustering algorithms:
NodeXL can automatically identify groups within a network based
solely on network structure. In contrast to the approach of using existing
data about the attributes as used in Section 7.2.3, this approach is based
solely on who is connected to whom. A number of different network
“clustering” (also known as “community detection”) algorithms exist, which
help find subgroups of highly inter-connected vertices within a network.
NodeXL includes three such algorithms: Clauset-Newman-Moore, Wakita-
Tsurumi [3], and Girvan-Newman (which can take a long time to run on
large graphs). In all of these algorithms, the number of clusters is not
predetermined; instead the algorithm dynamically determines the number it

22
thinks is best. Each vertex is assigned to exactly one cluster, meaning that
clusters do not overlap. The number of vertices in each cluster can vary
significantly. In some cases, a single cluster can encompass all vertices,
whereas in other cases, a cluster can consist of a single vertex. See
Newman [4] for background on some of these and other community
identifying algorithms.
There is no “right” or “wrong” algorithm to use; instead, it is often useful to
try out different ones and see which ones you believe provide the best results
given your network. For example, in this network, the Clauset-Newman-
Moore algorithm results in fewer, larger groups than the other algorithms,
which provide more groups of a smaller size. Try applying the Wakita-
Tsurumi clustering algorithm by clicking on the Groups dropdown menu in
the NodeXL ribbon and choosing Group by Cluster and the checking the
appropriate selector as shown in Figure 7.14. Notice that the data on the
Groups worksheet is now updated to reflect the new groups.

23

24
A social network for an individual is created with his/her interactions and
personal relationships with other members in the society. Social networks
represent and model the social ties among individuals. With the rapid
expansion of the web, there is a tremendous growth in online interaction of
the users. Many social networking sites, e.g., Facebook, Twitter etc. have
also come up to facilitate user interaction. As the number of interactions
have increased manifold, it is becoming difficult to keep track of these
communications. Human beings tend to get associated with people of similar
likings and tastes. The easy-to-use social media allows people to extend their
social life in unprecedented ways since it is difficult to meet friends in the
physical world, but much easier to find friends online with similar interests.
These real-world social networks have interesting patterns and properties
which may be analysed for numerous useful purposes. Social networks have
a characteristic property to exhibit a community structure. If the vertices of
the network can be partitioned into either disjoint or overlapping sets of
vertices such that the number of edges within a set exceeds the number of
edges between any two sets by some reasonable amount, we say that the
network displays a community structure. Networks disp+laying a community
structure may often exhibit a hierarchical community structure as well1 .
The process of discovering the cohesive groups or clusters in the network is
known as community detection. It forms one of the key tasks of Social
network analysis2 . The detection of communities in social networks can be
useful in many applications where group decisions are taken, e.g.,
multicasting a message of interest to a community instead of sending it to
each one in the group or recommending a set of products to a community.
The applications of community detection have been highlighted towards the
end of the article. State of the art in community detection research for social
networks is presented in this work. The paper begins with the basic concepts
of social networks and communities. Various methods for community
detection are categorised and discussed in the next section followed by list of
standard datasets used for analysis in community detection research along
with the links for download if available online. Some potential applications of

25
community detection in social networks are briefly described in the next
section. Discussion section argues the advantages of using a method with
respect to another, the kind of community structure they obtain, etc. and
the conclusion section concludes the paper. BASIC CONCEPTS SOCIAL
NETWORK A social network is depicted by social network graph 𝐺 consisting
of 𝑛 number of nodes denoting 𝑛 individuals or the participants in the
network. The connection between node 𝑖 and node 𝑗 is represented by the
edge 𝑒𝑖𝑗 of the graph. A directed or an undirected graph may illustrate these
connections between the participants of the network. The graph can be
represented by an adjacency matrix 𝐴 in which 𝐴𝑖𝑗 = 1 in case there is an
edge between 𝑖 and 𝑗 else 𝐴𝑖𝑗 = 0. Social networks follow the properties of
complex networks3,4 . Some real life examples1 of social networks include
friends based, telephone, email and collaboration networks. These networks
can be represented as graphs and it is feasible to study and analyse them to
find interesting patterns amongst the entities. These appealing prototypes
can be utilized in various useful applications. Community A community can
be defined as a group of entities closer to each other in comparison to other
entities of the dataset. Community is formed by individuals such that those
within a group interact with each other more frequently than with those
outside the group. The closeness between entities of a group can be
measured via similarity or distance measures between entities. McPherson et
al5 stated that “similarity breeds connection”. They discussed various social
factors which lead to similar behaviour or homophily in networks. The
communities in social networks are analogous to clusters in networks. An
individual represented by a node in graphs may not be part of just a
community or a group, it may be an element of many closely associated or
different groups existing in the network. For example a person may
concurrently belong to college, school, friends and family groups. All such
communities which have common nodes are called overlapping
communities. Identification and analysis of the community structure has
been done by many researchers applying methodologies from numerous
form of sciences. The quality of clustering in networks is normally judged by

26
clustering coefficient which is a measure of how much the vertices of a
network tend to cluster together. The global clustering coefficient6 and the
local clustering coefficient7 are two types of clustering coefficients discussed
in literature. Methods for grouping similar items Communities are those
parts of the graph which have denser connections inside and few
connections with the rest of the graph8 . The aim of unsupervised learning is
to group together similar objects without any prior knowledge about them. In
case of networks, the clustering problem refers to grouping of nodes
according to their similarity computed based on topological features and/or
other characteristics of the graph. Network partitioning and clustering are
two commonly used methods in literature to find the groups in the social
network graph. These methods are briefly described in the next subsections.
Graph partitioning Graph partitioning is the process of partitioning a graph
into a predefined number of smaller components with specific properties. A
common property to be minimized is called cut size. A cut is a partition of
the vertex set of a graph into two disjoint subsets and the size of the cut is
the number of edges between the components. A multicut is a set of edges
whose removal divides the graph into two or more components. It is
necessary to specify the number of components one wishes to get in case of
graph partitioning. The size of the components must also be specified, as
otherwise a likely but not meaningful solution would be to put the minimum
degree vertex into one component and the rest of the vertices into another.
Since the number of communities is usually not known in advance, graph
partitioning methods are not suitable to detect communities in such cases.
Clustering Clustering is the process of grouping a set of similar items
together in structures known as clusters. Clustering the social network
graph may give a lot of information about the underlying hidden attributes,
relationships and properties of the participants as well as the interactions
among them. Hierarchical clustering and partitioning method of clustering
are the commonly used clustering techniques used in literature. In
hierarchical clustering, a hierarchy of clusters is formed. The process of
hierarchy creation or levelling can be agglomerative or divisive. In

27
agglomerative clustering methods, a bottom-up approach to clustering is
followed. A particular node is clubbed or agglomerated with similar nodes to
form a cluster or a community. This aggregation is based on similarity. In
divisive clustering approaches, a large cluster is repeatedly divided into
smaller clusters. Partitioning methods begin with an initial partition amidst
the number of clusters pre-set and relocation of instances by moving them
across clusters, e.g., K-means clustering. An exhaustive evaluation of all
possible partitions is required to achieve global optimality in partitioned-
based clustering. This is time consuming and sometimes infeasible, hence
researchers use greedy heuristics for iterative optimization in partitioning
methods of clustering. The next section categorizes and discusses major
algorithms for community detection. ALGORITHMS FOR COMMUNITY
DETECTION A number of community detection algorithms and methods
have been proposed and deployed for the identification of communities in
literature. There have also been modifications and revisions to many
methods and algorithms already proposed. A comprehensive survey of
community detection in graphs has been done by Fortunato8 in the year
2010. Other reviews available in literature are by Coscia et al9 in 2011,
Fortunato et al10 in 2012, Porter et al11 in 2009, Danon et al12 in 2005,
and Plantié et al13 in 2013. The presented work reviews the algorithms
available till 2015 to the best of our knowledge including the algorithms
given in the earlier surveys. Papers based on new approaches and
techniques like big data, not discussed by previous authors have been
incorporated in our article.The algorithms for community detection are
categorized into approaches based on graph partitioning, clustering, genetic
algorithms, label propagation along with methods for overlapping community
detection (clique based and non-clique based methods), and community
detection for dynamic networks. Algorithms under each of these categories
are described below. Graph partitioning based community detection Graph
partitioning based methods have been used in literature to divide the graph
into components such that there are few connections between components.
The Kernighan-Line14 algorithm for graph partitioning was amongst the

28
earliest techniques to divide a graph. It partitions the nodes of the graph
with cost on edges into subsets of given sizes so as to minimize the sum of
costs on all edges cut. A major disadvantage of this algorithm however is
that the number of groups have to be predefined. The algorithm however is
quite fast with a worst case running time of 𝑂(𝑛 2 ). Newman15 reduces the
widely-studied maximum likelihood method for community detection to a
search through a group of candidate solutions, each of which is itself a
solution to a minimum cut graph partitioning problem. The paper shows
that the two most essential community inference methods based on the
stochastic block model or its degree-corrected variant16 can be mapped onto
versions of the familiar minimum-cut graph partitioning problem. This has
been illustrated by adapting Laplacian spectral partitioning method17, 18 to
perform community inference. Clustering based community detection The
main concern of community detection is to detect clusters, groups or
cohesive subgroups. The basis of a large number of community detection
algorithms is clustering. Amongst the innovators of community detection
methods, Girvan and Newman19 had a main role. They proposed a divisive
algorithm based on edge-betweenness for a graph with undirected and
unweighted edges. The algorithm focused on edges that are most “between”
the communities and communities are constructed progressively by
removing these edges from the original graph. Three different measures for
calculation of edge-betweenness in vertices of a graph were proposed in
Newman and Girvan20 . The worst-case time complexity of the edge
betweenness algorithm is 𝑂(𝑚2𝑛) and is 𝑂(𝑛 3 ) for sparse graphs, where m
denotes the number of edges and n is the number of vertices. The Girvan
Newman (GN) algorithm has been enhanced by many authors and applied to
various networks21-28. Chen et al22 extended GN algorithm to partition
weighted graphs and used it to identify functional modules in the yeast
proteome network. Rattigan et al21 proposed the indexing methods to
reduce the computational complexity of the GN algorithm significantly.
Pinney et al24 also build an algorithm which uses GN algorithm for the
decomposition of networks based on graph theoretical concept of

29
betweenness centrality. Their paper inspected utility of betweenness
centrality to decompose such networks in diverse ways. Radicchi29 et al also
proposed an algorithm based on GN algorithm introducing a new definition
of community. They defined ‘strong’ and ‘weak’ communities. The algorithm
uses an edge clustering coefficient to perform the divisive edge removal step
of GN and has a running time of 𝑂( 𝑚4 𝑛2 ) and 𝑂(𝑛 2 ) for sparse graphs.
Moon et al30 have proposed and implemented the parallel version of the GN
algorithm to handle large scale data. They have used MapReduce model
(Apache Hadoop) and GraphChi. Newman and Girvan first defined a
measure known as ‘modularity’ to judge the quality of partitions or
communities formed20 . The modularity measure proposed by them has
been widely accepted and used by researchers to gauge the goodness of the
modules obtained from the community detection algorithms with high
modularity corresponding to a better community structure. Modularity was
defined as ∑𝒊𝒆𝒊𝒊 − 𝒂𝒊𝟐 , where 𝒆𝒊𝒊 denotes fraction of the edges that connect
vertices in community 𝒊, 𝒆𝒊𝒋 denotes fraction of the edges connecting vertices
in two different communities 𝒊 and 𝒋 while 𝒂𝒊 = ∑𝒋𝒆𝒊𝒋 is the fraction of edges
that connect to vertices in community 𝒊. The value 𝑄 = 1 indicates a network
with strong community structure. The optimization of modularity function
has received great attention in literature. The table 1 lists clustering based
community detection methods, including algorithms which use modularity
and modularity optimization. Newman31 has worked to maximize
modularity so that the process of aggregating nodes to form communities
leads to maximum modularity gain. This change in modularity upon joining
two communities defined as ∆𝑄 = 𝑒𝑖𝑗 + 𝑒𝑗𝑖 − 2𝑎𝑖𝑎𝑗 = 2(𝑒𝑖𝑗−𝑎𝑖𝑎𝑗) can be
calculated in constant time and hence is faster to execute in comparison to
the GN algorithm. The run time of the algorithm is 𝑂(𝑛 2 ) for sparse graphs
and 𝑂((𝑚 + 𝑛)𝑛) for others. In a recent work, a scalable version of this
algorithm has been implemented using MapReduce by Chen et al32.
Newman33 generalized the betweenness algorithm for weighted networks.
The modularity was now represented as 𝑄 = 1 2𝑚 ∑𝑖𝑗[𝐴𝑖𝑗 − 𝑘𝑖𝑘𝑗 2𝑚 ]δ(𝑐𝑖 , 𝑐𝑗)
where 𝑚 = 1 2 ∑𝑖𝑗𝐴𝑖𝑗 represents the number of edges between communities 𝑐𝑖

30
and 𝑐𝑗 in the graph, while 𝑘𝑖 , 𝑘𝑗 are degrees of vertices 𝑖 and 𝑗 while 𝛿(𝑢, 𝑣) is
1 if 𝑢 = 𝑣 and 0 otherwise. Newman34 in yet another approach characterised
the modularity matrix in terms of eigenvectors. The equation for modularity
was changed to 𝑄 = 1 4𝑚𝑠𝑇𝐵𝑠 , where the modularity matrix was given as 𝐵𝑖𝑗
= 𝐴𝑖𝑗 − 𝑘𝑖𝑘𝑗 2𝑚 and modularity was defined using eigenvectors of the
modularity matrix. The algorithm runs in 𝑂(𝑛 2 𝑙𝑜𝑔𝑛) time, where 𝑙𝑜𝑔𝑛
represents the average depth of the dendrogram .

31
Clauset et al35 used greedy optimization of modularity to detect
communities for large networks. For a network structure with m edges and n
vertices, the algorithm has a running time of (𝑚𝑑𝑙𝑜𝑔𝑛) , where ‘d’ denotes the
depth of the dendrogram. For sparse real world networks the running time
is(𝑛𝑙𝑜𝑔2n) . Blondel et al36 designed an iterative two phase algorithm known
as Louvain method. In first phase, all nodes are placed into different
communities and then the modularity gain of moving a node 𝑖 from one
community to another is found. In case this modularity gain is positive, the

32
node is shifted to a new community. In second phase all the communities
found in earlier phase are treated as nodes and the weight of links is found.
The algorithm improves the time complexity of the GN algorithm. It has a
linear run time of 𝑂(𝑚). Guimera et al37 used simulated annealing for
modularity optimization and showed that computing the modularity of a
network is similar to determining the ground-state energy of a spin system.
Additionally, the authors showed that the stochastic network models give
rise to modular networks due to fluctuations. Zhou et al38 attempted to
improve modularity using simulated annealing introducing the idea of ‘inter
edges’ and ‘intra edges’. The authors modified the modularity equation to
include inter and intra edges as 𝑄 = 1 2𝑚 ∑ [(𝐴𝑖𝑗𝑛𝑖𝑗 − 𝑘𝑖𝑘𝑗 2𝑚 )𝛿(𝐶𝑖 , 𝐶𝑗) − 𝛽
(𝐴𝑖𝑗 − 𝑘𝑖𝑘𝑗 2𝑚 ) 𝛼 (1 − 𝛿(𝐶𝑖 , 𝐶𝑗))] Intra factor Inter factor Here α and β are
undetermined parameters and affect the value of the inter-factor. The value
of β is increased and α is reduced when large communities are expected.
Duch et al39 proposed a heuristic search based approach for the
optimization of modularity function using extremal optimization technique,
which has a complexity of 𝑂(𝑛 2 𝑙𝑜𝑔2𝑛). AdClust method40 can extract
modules from complex networks with significant precision and strength.
Each node in the network is assumed to act as a self-directed agent
representing flocking behaviour. The vertices of the network travel towards
the desirable adjoining groups. Wahl and Sheppard41 proposed hierarchical
fuzzy spectral clustering based approach. They argued that determining the
sub-communities and their hierarchies are as important as determining
communities within a network. DENGRAPH42 algorithm uses the idea of
density-based incremental clustering of spatial data and is intended to work
for large dynamic datasets with noise. The Markov Clustering
Algorithm(MCL)43 is a graph flow simulation algorithm which can be used to
detect clusters in a graph and is analogous to detection of communities in
the networks. This algorithm consists of two alternate processes of
‘expansion’ and ‘inflation’. Markov chains are employed to perform random
walk through a graph. The method has a worst case run time of 𝑂(𝑛𝑘 2 )
where n represents the number of nodes and k is the number of resources.

33
Nikolaev et al44 used ‘entropy centrality measure’ based on Markovian
process to iteratively detect communities. A random walk through the nodes
is performed to find the communities existing in the network structure. For a
graph, the transition probability matrix for a Markov chain is created. A
locality t is selected and those edges for which the average entropy centrality
for the nodes over the graph is reduced are selected and removed. The
algorithm proposed by Steinhaeuser et al45 performs many short random
walks and interprets visited nodes during the same walk as similar nodes
which gives an indication that they belong to the same community. The
similar nodes are aggregated and community structure is created using
consensus clustering. It has a runtime of 𝑂(𝑛 2 𝑙𝑜𝑔𝑛).
Genetic algorithms (GA) based community detection :
Genetic algorithms (GA) are adaptive heuristic search algorithms whose aim
is to find the best solution under the given circumstances. A genetic
algorithm starts with a set of solutions known as chromosomes and fitness
function is calculated for these chromosomes. If a solution with a maximum
fitness is obtained, one stops else with some probability crossover and
mutation operators are applied to the current set of solutions to obtain the
new set of solutions. Community detection can be viewed as an optimization
problem in which an objective function that captures the intuition of a
community with better internal connectivity than external connectivity is
chosen to be optimized. GA have been applied to the process of community
discovery and analysis in a few recent research works. These are described
briefly in this section. Table 2 enlists the algorithms available in literature
for community detection based on GA.

34
Pizzuti46 proposed the GA-Net algorithm which uses a locus based graph
representation of the network. The nodes of the social network are depicted
by genes and alleles. The algorithm introduces and optimizes the community
score to measure the quality of partitioning. All the dense communities
present in the network structure are obtained at the end of the algorithm by
selectively exploring the search space, without the need to know in advance
the exact number of groups. Another GA based approach MOGA-Net47
proposed by the same author optimizes two objective functions i.e. the
community score and community fitness . The higher the community score,
the denser the clustering obtained. The community fitness is sum of fitness
of nodes belonging to a module. When this sum reaches its maximum, the
number of external links is minimized. MOGA-Net generates a set of
communities at different hierarchical levels in which solutions at deeper
levels, consisting of a higher number of modules, are contained in solutions
having a lower number of communities. Hafez et al48 have performed both
Single-Objective and Multi-Objective optimization for community detection
problem. The former optimization was done using roulette selection based
GA while NSGA-II algorithm was used for the latter process. Mazur et al49

35
have used modularity as the fitness function in addition to the community
score. The authors worked on undirected graphs and their algorithm can
also discover single node communities. Liu et al50 used GA in addition to
clustering to find the community structures in a network. The authors have
used a strategy of repeated divisions. The graph is initially divided into two
parts, then the subgraphs are further divided and a nested GA is applied to
them. Tasgin et al51 have also optimized the network modularity using GA.
A multi-cultural algorithm52 for community detection employs the fitness
function defined by Pizzuti46 in GA-Net. The belief space which is a state
space for the network and contains a set of individuals that have a better
fitness value has been used in this work to guide the search direction by
determining a range of possible states for individuals. A genetic algorithm for
the optimization of modularity, proposed by Nicosia et al53 and has been
explained in the overlapping communities section later.
Label propagation based community detection:
Label propagation in a network is the propagation of a label to various nodes
existing in the network. Each node attains the label possessed by a
maximum number of the neighbouring nodes. This section discusses some
label propagation based algorithms for discovering communities. Table 3
contains a listing of these algorithms, discussed in detail later in the section.

36
Label Propagation Algorithm(LPA) was proposed by Raghavan et al54 in
which initially each node tries to achieve a label from the maximum number
of labels possessed by its neighbours. The stopping criteria for the process
was also the same, i.e., when each node achieves a label, which a maximum
number of its neighbouring nodes have. Each iteration of the algorithm
takes 𝑂(𝑚) time where m is the number of edges. SLPA (speaker listener label
propagation algorithm)55 is an extension to LPA which could analyse
different kinds of communities such as disjoint communities, overlapping
communities and hierarchical communities in both unipartite and bipartite
networks. The algorithm has a linear run time of 𝑂(𝑇𝑚), where T is the user
defined maximum number of iterations and m is the number of edges. Based
on the SLPA algorithm, Hu56 proposed a Weighted Label Propagation
Algorithm (WLPA). It uses the similarity between any two of the vertices in a
network based on the labels of the vertices achieved in label propagation.
The similarity of these vertices is then used as a weight of the edge in label
propagation. LPA was further improved by Gregory57 in his algorithm
COPRA (Community Overlap Propagation Algorithm). It was the first label
propagation based procedure which could also detect overlapping
communities. The run time per iteration is 𝑂(𝑣𝑚𝑙𝑜𝑔(𝑣𝑚⁄𝑛), here n is the
number of nodes, m is the edges and v is the maximum number of

37
communities per vertex. LabelRank Algorithm58 uses the LPA and MCL
(Markov Clustering Algorithm). The node identifiers are used as labels. Each
node receives a number of labels from the neighbouring nodes. A community
is formed for nodes having the same highest probability label. Four operators
are applied namely propagation which propagates the label to neighbours,
inflation i.e. the inflation operator of the MCL algorithm, cut-off operator that
removes the labels below a threshold and an explicit conditional update
operator responsible for a conditional update. The algorithm runs in 𝑂(𝑚)
time where m is the number of edges. The LabelRank algorithm was modified
to LabelRankT algorithm by Xie et al60. This algorithm included both the
edge weights and the edge directions in the detection of communities. This
algorithm works for dynamic networks as well and is able to detect evolving
communities also. Wu et al59 proposed a Balanced Multi Label Propagation
Algorithm (BMPLA) for detection of overlapping communities. Using this
algorithm, vertices can belong to any number of communities without having
a global maximum limit on largest number of communities membership
required by COPRA57 . Each iteration of the algorithm takes 𝑂(𝑛𝑙𝑜𝑔𝑛) time to
execute, where n is the number of nodes. Semantics based community
detection Semantic content and edge relationships in a semantic network
may be additionally used to partition the nodes into communities. The
context, as well as the relationship of the nodes, both are taken into
consideration in the process of semantic community detection. LDA(Latent
Dirichlet Allocation)61 is used in several semantic community based
community detection approaches. A clustering algorithm based on the link-
field-topic (LFT) model is put forward by Xin et al62 to overcome the
limitation of defining the number of communities beforehand. The study
forms the semantic link weight (SLW) based on the investigation of LFT, to
evaluate the semantic weight of links for each sampling field. The proposed
clustering algorithm is based on the SLW which could separate the semantic
social network into clustering units. In another work63 the authors have
used ARTs model and divided the process into two phases namely LDA
sampling and community detection. In the former process multiple sampling

38
ARTs have been designed. A community clustering algorithm has also been
proposed. The procedure could detect the overlapping communities. Xia et
al64 constructed a semantic network using information from the comment
content extracted from the initial HTML source files. An average score is
obtained for two users for each link assuming comments to be implicit links
between people. An analytic method for taking out comment content is
proposed to build the semantic network for example, the terms and phrases
in data are counted in comments as supportive or opposing. Each phrase is
given an associated numerical trust value. On this semantic network, the
classical community detection algorithm is applied henceforth. Ding65 has
considered the impact of topological as well as topical elements in
community detection. Topology based approaches are based on the idea that
the real world networks can be modelled as graphs where the nodes depict
the entities whereas the interactions between them are shown by the edges
of the graph. On the other hand topic based community detection have a
basis that the more words two objects share, the more similar they are. The
author performs systematic analysis with topology-based and topic-based
community detection methodologies on the co-authorship networks. The
paper puts forward the argument that, to detect communities, one should
take into account together the topical and topological features of networks. A
community detection algorithm, SemTagP (Semantic Tag propagation) has
been proposed by Ereteo et al66 that takes yield of the semantic data
captured while organizing the RDF graphs of social networks. It basically is
an extension of the LPA54 algorithm to perform the semantic propagation of
tags. The algorithm detects and moreover labels communities using the tags
used by group during the social labelling process and the semantic
associations derived between tags. In a study by Zhao et al67, a topic
oriented approach consisting of an amalgam of social objects clustering and
link analysis has been used. Firstly a modified form of k means clustering
named as ‘Entropy Weighting K-Means (EWKM) algorithm’ has been used to
cluster the social objects. A subspace clustering algorithm is applied to
cluster all the social objects into topics. On the clusters obtained in this

39
process, topical community detection or link analysis is performed using a
modularity optimization algorithm. The members of the objects are
separated into topical clusters having unique topic. A link analysis is
performed on each topical cluster to discover the topical communities. The
end result of the entire method is topical communities. A community
extraction approach is given by Abdelbary et al68, which integrates the
content published within the social network with its semantic features.
Community discovery is performed using two layer generative Restricted
Boltzmann Machines model. The model presumes that members of a
community communicate over matters of common concern. The model
permits associate members to belong to multiple communities. Latent
semantic analysis (LSA)69 and Latent Dirichlet Allocation (LDA)61 are the
two techniques extensively employed in the process to detect topical
communities. Nyugen et al70 have used LDA to find hyper groups in the blog
content and then sentiment analysis is done to further find the meta-groups
in these units. A Link-Content model is proposed by Natarajan et al71 for
discovering topic based communities in social networks. Community has
been modelled as a distribution employing Gibbs sampling. This paper uses
links and content to extract communities in a content sharing network
Twitter. Methods to detect overlapping communities A recent survey by
Amelio et al gives a comprehensive review of major overlapping community
detection algorithms and includes the methods on dynamic networks. There
exists another review of methods for discover overlapping communities done
by Xie et al72. The following section discusses some of the methods to detect
overlapping communities. Tables 4 and 5 enlist the methods discussed in
this section. Clique based methods for overlapping community detection A
community can be interpreted as a union of smaller complete (fully
connected) subgraphs that share nodes. A k-clique is a fully connected
subgraph consisting of k nodes. A k-clique community can be defined as
union of all k-cliques that can be reached from each other through a series
of adjacent k-cliques. Many researchers have used cliques to detect

40
overlapping communities. Important contributions using cliques for
overlapping community detection are summarized in table 4.
The Clique Percolation Method (CPM) was proposed by Palla et al73 to detect
overlapping communities. The method first finds all cliques of the network
and uses the algorithm of Everett et al83 to identify communities by
component analysis of clique-clique overlap matrix. CPM has a runtime of
𝑂(exp(𝑛)). The Clique Percolation Method (CPM) method proposed by Palla et
al73 could not discover the hierarchical structure along with the overlapping
attribute. This limitation was overcome through method proposed by
Lancichinetti et al74 . It performs a local exploration in order to find the
community for each of the node. In this process the nodes may be revisited
any number of times. The main objective was to find local maxima based on
a fitness function. CFinder84 software was developed using CPM for
overlapping community detection. Du et al75 proposed ComTector
(Community DeTector) for detection of overlapping communities using
maximal cliques. Initially all maximal cliques in the network are found which
form the kernels of potential community. Then agglomerative technique is
iteratively used to add the vertices left to their closest kernels. The obtained
clusters are adjusted by merging pair of fractional communities in order to

41
optimize the modularity of the network. The running time of the algorithm is
(𝐶∗𝑇2 ), where the communities detected are denoted by C and T is the
number of triangles in the network. EAGLE, an agglomerative hierarchical
clustering based algorithm has been proposed by Shen et al 76. In the first
step maximal cliques are discovered and those smaller than a threshold are
discarded. Subordinate maximal cliques are neglected, remaining give the
initial communities (also the subordinate vertices). The similarity is found
between these communities, and communities are repeatedly merged
together on the basis of this similarity. This is repeated till one community
remains at the end. Evans et al77 proposed that by partitioning the links of
a network, the overlapping communities may be discovered. In an extension
to this work, Evans et al78 used weighted line graphs. In another work
Evans79 used clique graphs to detect the overlapping communities in real
world social networks. GCE (Greedy Clique Expansion)80 first identifies
cliques in a network. These cliques act as seeds for expansion along with the
greedy optimization of a fitness function. A community is created by
expanding the selected seed and performing its greedy optimization via the
fitness function proposed by Lancichinetti et al74 . CONGA (Cluster-Overlap
Newman Girvan Algorithm) was proposed by Gregory25. This method was
based on split- betweenness algorithm of Girvan-Newman. The runtime of
the method is 𝑂(𝑚3 ). In another work CONGO81 (CONGA Optimized)
algorithm was proposed which used local betweenness measure, leading to
an improved complexity 𝑂(𝑛𝑙𝑜𝑔𝑛). A two phase Peacock algorithm for
detection of overlapping communities is proposed in Gregory82 using
disjoint community detection approaches. In the first phase, the network
transformation was performed using the split betweenness concept proposed
earlier by the author. In the second phase, the transformed network is
processed by a disjoint community detection algorithm and the detected
communities were converted back to overlapping communities of the original
network. Non clique methods for overlapping community detection Some
other non-clique methods to discover overlapping communities are given in
the table 5. These methods have been briefly explained in this section. An

42
extension of Newman’s modularity for directed graphs and overlapping
communities was done by Nicosia et al53 and modularity was given by 𝑄𝑜𝑣 =
1 𝑚 ∑𝑐∊𝐶 ∑𝑖,𝑗∊𝑉[𝛽𝑙(𝑖,𝑗),𝑐𝐴𝑖𝑗 − 𝛽𝑙(𝑖,𝑗),𝑐𝑜𝑢𝑡𝑘𝑖𝑜𝑢𝑡𝛽𝑙(𝑖,𝑗),𝑐𝑖𝑛𝑘𝑗𝑖𝑛𝑚 . The authors
defined a belongingness coefficient 𝛽𝑙,𝑐 of an edge 𝑙 connecting nodes 𝑖 and 𝑗
for a particular community 𝑐 and is given by 𝛽𝑙,𝑐 = ℱ(𝛼𝑖,𝑐 , 𝛼𝑗,𝑐) where
definition for ℱ(𝛼𝑖,𝑐 , 𝛼𝑗,𝑐) is taken as arbitrary, e.g., it can be taken as a
product of the belonging coefficients of the nodes involved, or as max(𝛼𝑖,𝑐 ,
𝛼𝑗,𝑐). 𝛽𝑙(𝑖,𝑗),𝑐𝑜𝑢𝑡 = ∑𝑗∊𝑉ℱ(𝛼𝑖,𝑐,𝛼𝑗,𝑐) |𝑉| , 𝛽𝑙(𝑖,𝑗),𝑐𝑖𝑛 = ∑𝑖∊𝑉ℱ(𝛼𝑖,𝑐,𝛼𝑗,𝑐) |𝑉| . A
genetic approach has been used in this work for the optimization of
modularity function. Another work which uses genetic approach to
overlapping community detection is GA-Net+, by Pizzuti85 GA-NET+ could
detect overlapping communities using edge clustering. Order Statistics Local
Optimization Method(OSLOM)86 detects clusters in networks, and can
handle various kind of graph properties like edge direction, edge weights,
overlapping communities, hierarchy and network dynamics. It is based on
local optimization of a fitness function expressing the statistical significance
of clusters with respect to random fluctuations, which is estimated with
tools of Extreme and Order Statistics. Baumes et al87 considered a
community as a subset of nodes which induces a locally optimal subgraph
with respect to a density function. Two different subsets with significant
overlap can be locally optimal which forms the basis to find overlapping
communities. Chen et al 88 used game-theoretic approach to address the
issue of overlapping communities. Each node is assumed to be an agent
trying to improve the utility by joining or leaving the community. The
community of the nodes in Nash equilibrium are assumed to form the
output of the algorithm. Utility of an agent is formulated as combination of a
gain and a loss function. To capture the idea of overlapping communities,
each agent is permitted to select multiple communities. In another game-
theoretic approach, Alvari et al89 proposed an algorithm consisting of two
methods PSGAME based on Pearson correlation, and NGGAME centred on
neighbourhood similarity measure. Alvari et al90 proposed the Dynamic
Game Theory method (D-GT) which treated nodes as rational agents. These

43
agents perform actions in iterative and game theoretic manner so as to
maximize the total utility.
Community detection for Dynamic networks:
Dynamic networks are the networks in which the membership of the
nodes of communities evolve or change over time. The task of community
identification for dynamic networks has received relatively less attention
than the static networks The methods have been categorized into two classes
by Bansal et al94, one designed for data which is evolving in real time known
as incremental or online community detection; and the other for data where
all the changes of the network evolution are known a priori, known as offline
community detection. Wolf et al95 proposed mathematical and
computational formulations for the analysis of dynamic communities on the
basis of social interactions occurring in the network. Tantipathananandh et
al96 made assumptions about the individual behaviour and group
membership. Henceforth they framed the objective as an optimization
problem by formulating three cost functions, namely i-cost, g-cost and c-
cost. Graph colouring and heuristics based approach were deployed.
FacetNet, proposed by Lin et al97 is a unified framework to study the
dynamic evolutions of communities. The community structure at any time
includes the network data as well as the previous history of the evolution.
They have used a cost function and proposed an iterative algorithm which
converges to an optimal solution. Palla et al98 conducted experiments on
two diverse datasets of phone call network and collaboration network to find
time dependence. After building joint graphs for two time steps, the CPM
algorithm73 was applied. They have used an auto-correlation function to
find overlap among two states of a community, and a stationarity parameter
which denotes the average correlation of various states. Greene et al99
proposed a heuristic technique for identification of dynamic communities in
the network data. They represented the dynamic network graph as an
aggregation of time step graphs. Step communities represent the dynamic

44
communities at a particular time. The algorithm begins with the application
of a static community detection algorithm on the graph. In the subsequent
steps, dynamic communities are created for each step and Jaccard similarity
is calculated. They have also generated benchmark dataset for experimental
work. The algorithm by Bansal et al94 involves the addition or deletion of
edges in the network. The algorithm is built on the greedy agglomerative
technique of the modularity based method earlier proposed in the work of
Clauset et al35 . He et al100 improvised Louvain method36 to include
concept of dynamicity in the formation of communities. A key point in their
algorithm is to make use of previously detected communities at time 𝑡 − 1 to
identify the communities at time 𝑡. Dinh et al101 proposed A3CS, an
adaptive framework which uses the power-law distribution and achieves
approximation guarantees for the NP-hard modularity maximization
problem, particularly on dynamic networks. Nguyen et al102 have attempted
to identify disjoint community structure in dynamic social networks. An
adaptive modularity-based framework Quick Community Adaptation (QCA)
is proposed. The method finds and traces the progress of network
communities in dynamic online social networks. Takaffoli et al103 have
proposed a two-step approach to community detection. In the first step the
communities extracted at different time instances are compared using
weighted bipartite matching. Next, a ‘meta’ community is constructed which
is defined as a series of similar communities at various time instances. Five
events to capture the changes to community are split, survive, dissolve,
merge, and form. A similarity function is used to calculate the similarity
between two communities and a community matching algorithm has been
employed thereafter. The authors, Kim et al104 proposed a particle-and-
density based evolutionary clustering method for discovery of communities
in dynamic networks. Their approach is grounded on the assumption that a
network is built of a number of particles termed as nano-communities,
where each community further is made up of particles termed as quasi-
clique-by-clique (l-KK). The density based clustering method uses cost
embedding technique and optimal modularity method to ensure temporal

45
smoothness even when the number of cluster varies. They have used an
information theory based mapping technique to recognize the stages of the
community i.e. evolving, forming or dissolving. Their method improves
accuracy and is time efficient as compared to the FacetNet method proposed
earlier. In another approach proposed by Chi et al105, two frameworks for
evolutionary spectral clustering have been proposed namely PCQ (Preserving
cluster quality) and PCM (Preserving cluster membership). In this work the
temporal smoothness is ensured by some terms in the clustering cost
functions. These two frameworks combine the processes of community
extraction and the community evolution process. They use a cost function
which consists of the snapshot and temporal cost. The clustering quality of
any partition determines the snapshot cost while the temporal cost definition
varies for each of the frameworks. For PCQ framework, the temporal cost is
decided by the cluster quality when the current partition is applied to the
historic data. In PCM, the difference between the current and the historic
partition gives the temporal cost. Both the frameworks proposed, can tackle
the change in number of clusters. In their work DYNMOGA (Dynamic
MultiObjective Genetic Algorithm), the authors Folino et al106 have used a
genetic algorithm based approach to dynamic community detection. They
attempt to achieve temporal smoothness by multiobjectiveoptimisation,
i.e.maximisation of snapshot quality (community score is used) and
minimization of temporal cost (here NMI is used). Kim et al107 in their
method CHRONICLE have performed two stage clustering and the method
can detect clusters of path group type also in addition to the single path type
clusters. In first stage of the algorithm, called as CHRONICLE1st the cosine
similarity measure is used. In second stage of the algorithm the measure
proposed and used is general similarity (GS). It is a combination of the two
measures structural affinity and weight affinity.

46
SOME POTENTIAL APPLICATIONS OF COMMUNITY
DETECTION:
With the enormous growth of the social networking site users, the graphs
representing these sites are becoming very complex, hence difficult to
visualize and understand. Communities can be considered as a summary of
the whole network thus making the network easy to comprehend. The
discovery of these communities in social networks can be useful in various
applications. Some of the applications where community detection is useful
are briefly described below.
Improving recommender systems with community detection
Recommender Systems use data of similar users or similar items to
generate recommendations. This is analogous to the identification of groups,
or similar nodes in a graph. Hence community detection holds an immense
potential for recommendation algorithms. Cao et al114 have used a
community detection based approach to improve the traditional collaborative
filtering process of Recommender Systems. The process starts with the
mapping of user-item matrix to user similarity structure. On this matrix, a
discrete PSO (particle swarm optimization) algorithm is applied to detect
communities. The items are then recommended to the user based on the
discovered communities.
Evolution of communities in social media
With the increase in the number of social networking sites, the focus and
scope of sites are getting expanded. The sites are getting diversified in terms
of focus. In addition to common sites like Facebook, Twitter, MySpace and
Bebo, other sites like Flickr for photo-sharing have also come up. The
analysis of the tweetretweet and the follower-followee network in twitter
provides an insight into the community structure existing in the Twitter
network. Sentiment analysis of the tweets may be performed as an
intermediary step to find the general nature of the tweets and then
community detection algorithms may be applied to help deduce the
structure of communities. Zalmout et al115, applied the community

47
detection algorithm to UK political tweets dataset. CQA(Community question
answering) has been used by Zhang et al116 to discover overlapping
communities in dynamic networks based on user interactions.
Related Work:
Social Network:
A social network is depicted by social network graph consisting of number
of nodes denoting individuals or the participants in the network. The
connection between node and node is represented by the edge of the graph.
A directed or an undirected graph may illustrate these connections between
the participants of the network. The graph can be represented by an
adjacency matrix in which in case there is an edge between and else.
Social networks follow the properties of complex networks3,4. Some real life
examples1 of social networks include friends based, telephone, email and
collaboration networks. These networks can be represented as graphs and it
is feasible to study and analyse them to find interesting patterns amongst
the entities. These appealing prototypes can be utilized in various useful
applications.
Community:
A community can be defined as a group of entities closer to each other in
comparison to other entities of the dataset. Community is formed by
individuals such that those within a group interact with each other more
frequently than with those outside the group. The closeness between entities
of a group can be measured via similarity or distance measures between
entities. McPherson et al5 stated that “similarity breeds connection”. They
discussed various social factors which lead to similar behaviour or
homophily in networks. The communities in social networks are analogous
to clusters in networks. An individual represented by a node in graphs may
not be part of just a community or a group, it may be an element of many

48
closely associated or different groups existing in the network. For example a
person may concurrently belong to college, school, friends and family
groups. All such communities which have common nodes are called
overlapping communities. Identification and analysis of the community
structure has been done by many researchers applying methodologies from
numerous form of sciences. The quality of clustering in networks is
normally judged by clustering coefficient which is a measure of how much
the vertices of a network tend to cluster together. The global clustering
coefficient6 and the local clustering coefficient7 are two types of clustering
coefficients discussed in literature.
Methods for grouping similar items:
Communities are those parts of the graph which have denser connections
inside and few connections with the rest of the graph8. The aim of
unsupervised learning is to group together similar objects without any prior
knowledge about them. In case of networks, the clustering problem refers to
grouping of nodes according to their similarity computed based on
topological features and/or other characteristics of the graph. Network
partitioning and clustering are two commonly used methods in literature to
find the groups in the social network graph. These methods are briefly
described in the next subsections.
Graph partitioning Graph partitioning is the process of partitioning a graph
into a predefined number of smaller components with specific properties. A
common property to be minimized is called cut size. A cut is a partition of
the vertex set of a graph into two disjoint subsets and the size of the cut is
the number of edges between the
components. A multicut is a set of edges whose removal divides the graph
into two or more components. It is necessary to specify the number of
components one wishes to get in case of graph partitioning. The size of the
components must also be specified, as otherwise a likely but not meaningful
solution would be to put the minimum degree vertex into one component

49
and the rest of the vertices into another. Since the number of communities is
usually not known in advance, graph partitioning methods are not suitable
to detect communities in such cases.
Clustering is the process of grouping a set of similar items together in
structures known as clusters. Clustering the social network graph may give
a lot of information about the underlying hidden attributes, relationships
and properties of the participants as well as the interactions among them.
Hierarchical clustering and partitioning method of clustering are the
commonly used clustering techniques used in literature.
In hierarchical clustering, a hierarchy of clusters is formed. The process of
hierarchy creation or levelling can be agglomerative or divisive. In
agglomerative clustering methods, a bottom-up approach to clustering is
followed. A particular node is clubbed or agglomerated with similar nodes to
form a cluster or a community. This aggregation is based on similarity. In
divisive clustering approaches, a large cluster is repeatedly divided into
smaller clusters.
Partitioning methods begin with an initial partition amidst the number of
clusters pre-set and relocation of instances by moving them across clusters,
e.g., K-means clustering. An exhaustive evaluation of all possible partitions
is required to achieve global optimality in partitioned-based clustering. This
is time consuming and sometimes infeasible, hence researchers use greedy
heuristics for iterative optimization in partitioning methods of clustering.
The next section categorizes and discusses major algorithms for community
detection.
ALGORITHMS FOR COMMUNITY DETECTION:
A number of community detection algorithms and methods have been
proposed and deployed for the identification of communities in literature.
There have also been modifications and revisions to many methods and
algorithms already proposed. A comprehensive survey of community

50
detection in graphs has been done by Fortunato8 in the year 2010. Other
reviews available in literature are by Coscia et al9 in 2011, Fortunato et al10
in 2012, Porter et al11 in 2009, Danon et al12 in 2005, and Plantié et al13
in 2013. The presented work reviews the algorithms available till 2015 to the
best of our knowledge including the algorithms given in the earlier surveys.
Papers based on new approaches and techniques like big data, not
discussed by previous authors have been incorporated in our article.The
algorithms for community detection are categorized into approaches based
on graph partitioning, clustering, genetic algorithms, label propagation along
with methods for overlapping community detection (clique based and non-
clique based methods), and community detection for dynamic networks.
Algorithms under each of these categories are described below.
Graph partitioning based community detection:
Graph partitioning based methods have been used in literature to divide the
graph into components such that there are few connections between
components. The Kernighan-Line14 algorithm for graph partitioning was
amongst the earliest techniques to divide a graph. It partitions the nodes of
the graph with cost on edges into subsets of given sizes so as to minimize
the sum of costs on all edges cut. A major disadvantage of this algorithm
however is that the number of groups have to be predefined. The algorithm
however is quite fast with a worst case running time of 𝑂(𝑛2). Newman15
reduces the widely-studied maximum likelihood method for community
detection to a search through a group of candidate solutions, each of which
is itself a solution to a minimum cut graph partitioning problem. The paper
shows that the two most essential community inference methods based on
the stochastic block model or its degree-corrected variant16 can be mapped
onto versions of the familiar minimum-cut graph partitioning problem. This
has been illustrated by adapting Laplacian spectral partitioning method17,
18 to perform community inference.

51
Definition of Local Community:
The problem of local community detection is proposed by Clauset [15].
Usually we define the local community problem in the following way: there is
a nondirected graph , represents the set of nodes, and represents the edges
in the graph. The connecting information of partial nodes in the graph is
known or can be obtained. The local community is defined as . The set of
nodes connected with is defined as and the set of nodes in connected with
nodes in is defined as the boundary node set . That is to say, any node in is
connected to one node in , and the rest of is the core node set .
Clustering based community detection:
The main concern of community detection is to detect clusters, groups or
cohesive subgroups. The basis of a large number of community detection
algorithms is clustering. Amongst the innovators of community detection
methods, Girvan and Newman19 had a main role. They proposed a divisive
algorithm based on edge-betweenness for a graph with undirected and
unweighted edges. The algorithm focused on edges that are most “between”
the communities and communities are constructed progressively by
removing these edges from the original graph. Three different measures for
calculation of edge-betweenness in vertices of a graph were proposed in
Newman and Girvan20. The worst-case time complexity of the edge
betweenness algorithm is 𝑂(𝑚2𝑛) and is 𝑂(𝑛3) for sparse graphs, where m
denotes the number of edges and n is the number of vertices.
The Girvan Newman (GN) algorithm has been enhanced by many authors
and applied to various networks21-28. Chen et al22 extended GN algorithm
to partition weighted graphs and used it to identify functional modules in
the yeast proteome network. Rattigan et al21 proposed the indexing methods
to reduce the computational complexity of the GN algorithm significantly.
Pinney et al24 also build an algorithm which uses GN algorithm for the
decomposition of networks based on graph theoretical concept of

52
betweenness centrality. Their paper inspected utility of betweenness
centrality to decompose such networks in diverse ways. Radicchi29 et al also
proposed an algorithm based on GN algorithm introducing a new definition
of community. They defined ‘strong’ and ‘weak’ communities. The algorithm
uses an edge clustering coefficient to perform the divisive edge removal step
of GN and has a running time of 𝑂(𝑚4 𝑛2 ) and 𝑂(𝑛2) for sparse graphs.
Moon et al30 have proposed and implemented the parallel version of the GN
algorithm to handle large scale data. They have used MapReduce model
(Apache Hadoop) and GraphChi. Newman and Girvan first defined a
measure known as ‘modularity’ to judge the quality of partitions or
communities formed20. The modularity measure proposed by them has been
widely accepted and used by researchers to gauge the goodness of the
modules obtained from the community detection algorithms with high
modularity corresponding to a better community structure. Modularity was
defined as ∑𝒆𝒊𝒊𝒊 − 𝒂𝒊𝟐 , where 𝒆𝒊𝒊 denotes fraction of the edges that connect
vertices in community 𝒊, 𝒆𝒊𝒋 denotes fraction of the edges connecting vertices
in two different communities 𝒊 and 𝒋 while 𝒂𝒊 = ∑ 𝒆𝒊𝒋𝒋 is the fraction of edges
that connect to vertices in community 𝒊. The value 𝑄 = 1 indicates a network
with strong community structure. The optimization of modularity function
has received great attention in literature. The table 1 lists clustering based
community detection methods, including algorithms which use modularity
and modularity optimization. Newman31 has worked to maximize
modularity so that the process of aggregating nodes to form communities
leads to maximum modularity gain. This change in modularity upon joining
two communities defined as ∆𝑄 = 𝑒𝑖𝑗 + 𝑒𝑗𝑖 − 2𝑎𝑖𝑎𝑗 = 2(𝑒𝑖𝑗−𝑎𝑖𝑎𝑗) can be
calculated in constant time and hence is faster to execute in comparison to
the GN algorithm. The run time of the algorithm is 𝑂(𝑛2) for sparse graphs
and 𝑂((𝑚 + 𝑛)𝑛) for others. In a recent work, a scalable version of this
algorithm has been implemented using MapReduce by Chen et al32.
Newman33 generalized the betweenness algorithm for weighted networks.
The modularity was now represented as 𝑄 = 1 2𝑚 ∑ [𝐴𝑖𝑗𝑖𝑗 − 𝑘𝑖𝑘𝑗 2𝑚 ]δ(𝑐𝑖,𝑐𝑗)

53
where 𝑚 = 1 2 ∑ 𝐴𝑖𝑗𝑖𝑗 represents the number of edges between communities
𝑐𝑖 and 𝑐𝑗 in the graph, while 𝑘𝑖,𝑘𝑗 are degrees of vertices 𝑖 and 𝑗 while 𝛿(𝑢,𝑣)
is 1 if 𝑢 = 𝑣 and 0 otherwise. Newman34 in yet another approach
characterised the modularity matrix in terms of eigenvectors. The equation
for modularity was changed to
𝑄 =1 4𝑚𝑠𝑇𝐵𝑠 , where the modularity matrix was given as 𝐵𝑖𝑗 = 𝐴𝑖𝑗 − 𝑘𝑖𝑘𝑗 2𝑚
and modularity was defined using eigenvectors of the modularity
matrix.
Figure 1
Definition of local community:
Local community detection problem is to start from a preselected source
node. It adds the node meeting the conditions in into and removes the node
which does not meet the conditions from D gradually.
2.2. Related Algorithms:
At present, many local community detection algorithms have been proposed.
We introduce two representative local community detection algorithms.
(1)Clauset Algorithm: In order to solve the problem of local community
detection, Clauset [15] put forward the local community modularity R and
gave a fast convergence greedy algorithm to find the local community with
the greatest modularity.
Thedefinition of local community modularity is asfollows
where and represent two nodes in the graph. If nodes and are connected,
the value of is 1; otherwise, it is 0; if nodes and are both in , the value
of is 1; otherwise, it is 0.
The local community detection process of Clauset algorithm is similar to that
of web crawler algorithm. First, Clauset algorithm starts from an initial

54
node .Node is added to the subgraph , and all its neighbor nodes are added
to . Then the algorithm adds the node in which can bring the maximum
increment of into the local community iteratively, until the scale of the local
community reaches the preset size. That is to say, the algorithm needs to set
up a parameter to decide the size of the community, and the result is greatly
influenced by the initial node.
(2) LWP Algorithm: LWP [16] algorithm is an improved algorithm and it has
a clear end condition compared with Clauset algorithm. The algorithm
defines another local community modularity , which is expressed
aswhere and represent two nodes in the graph. If nodes and are
connected to each other, the value of is 1; otherwise, it is 0; if
nodes and are both in , the value of is 1; otherwise, it is 0; if only one of
the nodes and is in , the value of is 1; otherwise, it is 0.
Given an undirected and unweighted graph , LWP algorithm starts from an
initial node to find a subgraph with maximum value of . If the subgraph is a
community (i.e., ), then it returns the subgraph as a community. Otherwise,
it is considered that there is no community that can be found starting from
this initial node. For an initial node, LWP algorithm finds a subgraph with
the maximum value of local modularity by two steps. First, the algorithm is
initialized by constructing a subgraph with only an initial node and all the
neighbor nodes of node are added to the set . Then the algorithm performs
incremental step and pruning step.
In the incremental step, the node selected from which can make the local
modularity of increase with the highest value is added to iteratively. The
greedy algorithm will iteratively add nodes in to , until no node in can be
added. In the pruning step, if the local modularity of becomes larger when
removing a node from , then really remove it from . In the process of
pruning, the algorithm must ensure that the connectivity of is not destroyed
until no node can be removed. Then update the set and repeat the two steps
until there is no change in the process. The algorithm has a high Recall, but
its accuracy is low.

55
The complexity of these two algorithms is , where is the number of nodes to
be explored in the local community and is the average degree of the nodes to
be explored in the local community.
CHAPTER-2
LITERATURE

56
Literature Review:
A wide research study in the recent years focused on community detection in
complex systems [4], most of them focus on undirected networks to enhance
the efficiency of identifying communities in understanding complex
networks. For instances, Fortunato et al [3] based his proposed approach on
statistical inference perspectives, Schaeffer et al [5], proposed their approach
for clustering problem as an unsupervised learning task based on similarity
measure over the data of the network, Girvan and Newman based their
community detection proposal on betweenness calculation to find out
community boundaries where modularity measure is the overall quality of
the graph partitioning [6, 7]. The weight used by Newman and Girvan [7]
aims to be the betweenness measure of the edge, representing the number of
shortest paths connecting any pair of nodes passing through. However,
community detection problem has been studied mainly in case of undirected

57
networks, various solutions was proposed in this context, motivating many
disciplines to deal with the issue. Interestingly, Fortunato et al [3] mentioned
the few possibilities for extending techniques from undirected to directed
case, where the edge directedness is not the only complication that could
face the clustering problem. Nevertheless, diverse graph data in many real-
world applications are by nature directed, thus interesting to save available
information behinds the edge directionality. Malliaros et al [8], revealed in
their survey that the most common way for researcher community to deal
with the problem of clustering is to ignore the directionality of the graph,
then proceed to clustering with a wide range of proposed tools. Therefore,
most of community detection proposals can not be used directly on weighted
directed graphs, where the number of communities not always known in
advance and the communities present different granularity scale. Since the
problem of community detection in complex network analysis acquires more
attention, many researchers have been interested into structural information
and topological networks metrics [1, 3, 4, 6, 7, 8, 9]. In [10], S. Ahajjam
based the proposed community detection algorithms on a new scalable
approach using leader nodes characteristics through two steps: (i)
Identification of potential leaders in the network, and (ii) exploration of nodes
similarities around leaders to build communities. Therefore, recent works
start focusing on both topological and topical aspects [9, 11, 12] to overpass
limited performances of topology-based community detection approaches.
Topic-based community detection started gaining attention through different
works for community detection in complex network [9, 13, 14]. The essence
behind the approach is to similarly detect nodes with same properties, which
are not necessarily real connections between nodes of the network, in which
actors communicate on topics of mutual interest [14] to determine the
communities which are topically similar.
The main contributions of this thesis are to exploring and utilizing local
community detection related approaches to improve the services in Online

58
SocialNetworking Sites in terms of their recommendation efficiency and
accuracy.
Contributions are listed as follows:
M.E.J.NewmananDM.Girvan[1],”Finding and evaluating community structure
in network”. One of the most relevant features of graphs representing real
systems is community structure, or clustering. Detecting communities is of great
importance in sociology, biology and computer science, disciplines where systems
are often represented as graphs. Community detection is important for
other reasons, too. Identifying modules and their boundaries allows for a
classification of vertices, according to their structural position in the modules. So,
vertices lying at the boundaries between modules play an important role of
mediation and lead the relationships and exchanges between different
communities . Such a classification seems to be meaningful in social and
metabolic networks.
The procedure can be better illustrated by means of dendrograms. Sometimes,
stopping conditions are imposed to select a partition or a group of partitions that
satisfy a special criterion, like a given number of clusters or the optimization of a
quality function.
A simple way to identify communities in a graph is to detect the edges that connect
vertices of different communities and remove them, so that the clusters get
disconnected from each other. This is the philosophy of Divisivealgorithms.
H.-W.Shen and X.-Q.Cheng[3],”Spectral methods for the detection of neteork
community structure”.In this paper Spectral analysis has been successfully
applied at the detection of community structure of networks, respectively being
based on the adjacency matrix, the standard Laplacian matrix, the normalized
Laplacian matrix, the modularity matrix, the correlation matrix and several other
variants of these matrices. However, the comparison between these spectral
methods is less reported. More importantly, it is still unclear which matrix is more
appropriate for the detection of community structure. This paper answers the
question through evaluating the effectiveness of these five matrices against the

59
benchmark networks with heterogeneous distributions of node degree and
community size. In this paper, we conduct a comparative analysis of the
aforementioned five matrices on the benchmark networks which have
heterogeneous distributions of node degree and community size. The comparison is
carried out from two perspectives. The former one focuses on whether the number
of intrinsic communities can be exactly identified according to the spectrum of
these five matrices. The latter evaluates the effectiveness of these matrices at
identifying the intrinsic community structure using their eigenvectors
This paper carried out a comparative analysis on the spectral methods for the
detection of network community structure through evaluating the performance of
five widely used matrices on the benchmark networks with heterogeneous
distribution of node degree and community size. These five matrices are
respectively the adjacency matrix, the standard Laplacian matrix, the normalized
Laplacian matrix, the modularity matrix and the correlation matrix. Test results
demonstrate that the normalized Laplacian matrix and the correlation matrix
significantly outperform the other three matrices at identifying the community
structure of networks. This indicates that the heterogeneity of node degree is a
crucial ingredient for the detection of community structure using spectral
methods.
V.D.Blondel,J.Guillaume,R.Lambiotte et al[7],”Fast unfolding of communities in
large networks”. This Paper propose a simple method to extract the community
structure of large networks. Our method is a heuristic method that is based on
modularity optimization. It is shown to outperform all other known community
detection method in terms of computation time. Moreover, the quality of the
communities detected is very good, as measured by the so-called modularity. This
is shown first by identifying language communities in a Belgian mobile phone
network of 2.6 million customers and by analyzing a web graph of 118 million
nodes and more than one billion links. The accuracy of our algorithm is also
verified on ad-hoc modular networks. The problem of community detection
requires the partition of a network into communities of densely connected nodes,
with the nodes belonging to different communities being only sparsely connected.

60
This paper introduced an algorithm for optimizing modularity that allows to
study networks of unprecedented size. The limitation of the method for the
experiments that we performed was the storage of the network in main memory
rather than the computation time. The accuracy of our method has also been
tested on ad-hoc modular networks and is shown to be excellent in comparison
with other (much slower) community detection methods. By construction, our
algorithm unfolds a complete hierarchical community structure for the network,
each level of the hierarchy being given by the intermediate partitions found at each
pass.
K.M.Tan,D.Written and A.Shojaie[8],”The cluster graphical lasso for improved
estimation of Gaussian graphical models”. In this paper the task of estimating
a Gaussian graphical model in the high-dimensional setting is considered. The
graphical lasso, which involves maximizing the Gaussian log likelihood subject to a
lasso penalty, is a well-studied approach for this task. A surprising connection
between the graphical lasso and hierarchical clustering is introduced: the
graphical lasso in effect performs a two-step procedure, in which single linkage
hierarchical clustering is performed on the variables in order to identify connected
components, and then a penalized log likelihood is maximized on the subset of
variables within each connected component. Thus, the graphical lasso determines
the connected components of the estimated network via single linkage clustering.
The single linkage clustering is known to perform poorly in certain finite-sample
settings. Therefore, the cluster graphical lasso, which involves clustering the
features using an alternative to single linkage clustering, and then performing the
graphical lasso on the subset of variables within each cluster, is proposed.
We have shown that identifying the connected components of the graphical lasso
solution is equivalent to performing SLC based on S ̃ , the absolute value of the
empirical covariance matrix. Based on this connection, we have proposed the
cluster graphical lasso, an improved version of the graphical lasso for sparse
inverse covariance estimation. In this paper, we have considered the use of
hierarchical clustering in the CGL procedure. We have shown that performing
hierarchical clustering on S ̃ leads to consistent cluster recovery. As a byproduct,

61
we suggest a choice of λ1, …, λK in CGL that yields consistent identification of the
connected components. In addition, we establish the model selection consistency of
CGL.
Y.J.Wu,H.Huang,Z.F.Hao and F.Chen[17],”Local Community Detection using
link similarity”.In this paper Exploring local community structure is an
appealing problem that has drawn much recent attention in the area of social
network analysis. The existing approaches do well in measuring the community
quality, but they are largely dependent on source vertex and putting too strict
policy in agglomerating new vertices. Moreover, they have parameters which are
difficult to obtain. This paper proposes a method to¯find local community structure
by analyzing link similarity between the community and the vertex. Inspired by the
fact that elements in the same community are more likely to share common links,
we explore community structure heuristically by giving priority to vertices which
have a high link similarity with the community. A three-phase process is also used
for the sake of improving quality of community structure. Experimental results
prove that our method performs efectively not only in computer-generated graphs
but alsoin real-world graphs.
In this paper we present a method that mainly depends on link similarity
between the vertex and community to explore local community. This method
searches the potential vertices in a specific sequence so as to help improve the
accuracy. In this paper, we pro-posed an improved method to detect local
community structure. Our algorithm mainly takes advantage of link similarity
between the vertex and the community.A greedy agglomeration phase, an
optimization phase and a trimming phase are included in our algorithm.Compared
with other multi-phase algorithms, our al-gorithm implements comparatively
easier and stricter stopping criteria for the purpose of simplifying and op-timizing
our algorithm. Experimental results show that our algorithm can discover local
community better than other existing methods both in computer-generated
benchmark graphs and in real-world networks.

5.local community detection algorithm based on minimal cluster

5.local community detection algorithm based on minimal cluster

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to 5.local community detection algorithm based on minimal cluster

Similar to 5.local community detection algorithm based on minimal cluster (20)

More from Venkat Projects

More from Venkat Projects (20)

Recently uploaded

Recently uploaded (20)

5.local community detection algorithm based on minimal cluster