SlideShare a Scribd company logo
1 of 48
Download to read offline
IMPROVING EFFICIENCY OF COMMUNITY
MEMBER IDENTIFICATION USING SEED SET
EXPANSION
A thesis submitted in partial fulfilment of the requirements for
the award of the degree of
B. Tech
In
Computer Science and Engineering
By
Abishek Prasanna (106111002)
R Sibi (106111068)
Rahul R (106111070)
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
NATIONAL INSTITUTEOF TECHNOLOGY
TIRUCHIRAPALLI-620015
MAY 2015
BONAFIDE CERTIFICATE
This is to certify that the project titled IMPROVING EFFICIENCY OF
COMMUNITY MEMBER IDENTIFICATION USING SEED SET
EXPANSION is a bonafide record of the work done by
Abishek Prasanna (106111002)
R Sibi (106111068)
Rahul R (106111070)
in partial fulfilment of the requirements for the award of the degree of Bachelor of
Technology in Computer Science and Engineering of the NATIONAL
INSTITUTE OF TECHNOLOGY, TIRUCHIRAPPALLI, during the year 20014-
2015.
Dr. E. Sivasankar Dr. (Mrs) R. Leela Velusamy
Guide Head of the Department
Project Viva-voce held on _____________________________
Internal Examiner External Examiner
i
ABSTRACT
In many applications, network of people are involved and would like to identify the
members of an interesting but unlabelled group or community. Start with a small
number of exemplar group members – they may be followers of a political ideology
or fans of a music genre – and use those examples to discover the additional members.
This problem gives rise to the seed expansion problem in community detection: given
example community members, how can the social graph be used to predict the
identities of remaining, hidden community members? In contrast with global
community detection (graph partitioning or covering), seed expansion is best suited
for identifying communities locally concentrated around nodes of interest. A growing
body of work has used seed expansion as a scalable means of detecting overlapping
communities. Yet despite growing interest in seed expansion, there are divergent
approaches in the literature and there still isn’t a systematic understanding of which
approaches work best in different domains.
Here several variants are evaluated and subtle trade-offs between different approaches
have been uncovered. The different ideas in the algorithms that give room for
performance gains, focusing on heuristics that one can control in practice are
explored. As a consequence of this systematic understanding, several opportunities for
performance gains were discovered. We have thereby managed to develop our own
modification to the PageRank algorithm and have shown its higher performance in
comparison with the existing ones. This leads to interesting connections and contrasts
with active learning and the trade-offs of exploration and exploitation. Finally, we
explore the expansion problem by bringing in an adaptive algorithm that is found to
work well with the improved version that we have come up with. We evaluate our
methods across multiple domains, using publicly available datasets with labelled,
ground-truth communities.
Keywords: Seed set expansion, Ground-truth communities
ii
ACKNOWLEDGEMENTS
We would like to thank our project guide Dr.E.Sivasankar , Assistant Professor,
Department of Computer Science and Engineering , for his constant guidance ,
encouragement and help during the entire duration of the project. His enthusiasm has
been a driving force for our efforts through the course of this project.
We would also like to offer our sincere thanks to Dr. (Mrs).R.Leela Velusamy, Head
of the Department, Computer Science and Engineering, National Institute of
Technology, Trichy who provided us with the necessary environment, tools and
feedback for the implementation of the project.
iii
TABLE OF CONTENTS
Title Page Number
ABSTRACT……………………………………………………………… i
ACKNOWLEDGEMENTS …………………………………………….. ii
TABLE OF CONTENTS…………………………………………………. iii
LIST OF FIGURES ……………………………………………………… v
NOTATIONS …………………………………………………………….. vi
CHAPTER 1: INTRODUCTION
1.1 Motivation …………………………………………………….. 1
1.2 Community ……………………………………………………. 1
1.3 Community detection …………………………………………. 2
1.4 Practical Applications of Clustering , Community Detection… 3
1.5 Purpose Of Community Detection ……………………………... 4
1.5.1 Is it necessary to extract groups based on network topology? 5
1.5.2 Importance of network interaction ………………………. 5
1.6 Challenges ……………………………………………………… 5
1.7 Thesis Overview………………………………………………… 7
CHAPTER 2: LITERATURE SURVEY
2.1 Neighbour Counting Algorithm …………………………………. 8
2.1.1 Community Discovery Methods …………………………. 8
2.1.1.1 Graph Partition Techniques ………………………… 8
2.1.1.2 Hierarchical Clustering …………………………….. 9
2.1.2 Expanding an existing community………………………… 10
2.2 Greedy algorithm ……………………………………………….. 14
2.2 Drawbacks of Greedy algorithm ……………………………….. 16
2.3 Drawbacks of PageRank algorithm …………………………….. 16
iv
CHAPTER 3: IMPLEMENTATION
3.1 Finding potential members ………………………………………… 17
3.2 Improved PageRank ……………………………………………….. 19
CHAPTER 4: PERFORMANCE ANALYSIS
4.1 Neighbour Counting ………………………………………………. 25
4.2 Greedy Algorithm …………………………………………………. 26
4.3 PageRank algorithm ………………………………………………. 27
4.4 Comparing performances of PageRank and the proposed Improved
PageRank ………………………………………………………….. 28
4.5 Inference …………………………………………………………... 31
CONCLUSION AND FUTURE WORK …………………………………..... 32
REFERENCES ……………………………………………………………….. 33
BIBLIOGRAPHY …………………………………………………………… 34
APPENDICES
Appendix A Snippets of Codes ……………………………………... 36
Appendix B Glossary of Terms …………………………………….. 38
v
LIST OF FIGURES
Figure Page No
1.1 Community Structure …..........................................................................… 2
1.2 Categorization of various search engines …............................................... 4
1.3 Typical life cycle of a social media network …………….......................... 4
2.1 Neighbour counting representation ….......................................................... 14
3.1 Finding potential neighbour ….................................................................... 17
3.2 Flowchart depicting neighbour counting algorithm ….................................. 19
3.3 Flowchart depicting improved PageRank algorithm …............................... 21
3.4 Input WebGraph …........................................................................................ 22
3.5 Existing community members …................................................................. 22
3.6 Calculating PageRank using the existing code …......................................... 23
3.7 Calculating PageRank using improved algorithm ….................................... 23
3.8 Potential members are determined …............................................................ 24
3.9 Graph showing analysis between PageRank and Improved page rank …….. 24
4.1 Neighbour counting algorithm steps ………………………………………… 25
4.2 Greedy algorithm steps ……………………………………………………… 26
4.3 Proposed Improved PageRank Algorithm steps ……………………………. 27
4.4 Graphs showing iteration of PageRank ……………………………………… 28
4.5 Modularity Comparison between the three algorithms ……………………… 30
4.6 Outwardness Comparison between the three algorithms ……………………. 31
vi
NOTATIONS
Q Modularity
eij Edge directed from node ‘i’ to node ‘j’
Pr[n,k] Probability that entity would happen to have at least ‘k’ of n
neighbours from group by chance
P Fraction of known group members to network nodes
R Local modularity
Bij Adjacency matrix comprising only those edges with one or more end
points in community
Min Number of edges internal to community
Mout Number of edges external to community
Ov(C) Outwardness of vertex ‘v’ in community ‘C’
Kv Degree of vertex ‘v’
Kv
out
Number of neighbours outside community
Kv
in
Number of neighbours inside community
PR[A] PageRank of node ‘A’
C[A] Total number of outgoing links on ‘A’
N Total number of nodes in the graph
d Damping factor
1
CHAPTER 1
INTRODUCTION
1.1 Motivation
Networks are omnipresent on the Web. The most profound Web network is the Web
itself comprising billions of pages as vertices and their hyperlinks to each other as
edges. Moreover, collecting and processing the input of Web users (e.g. queries,
clicks) results in other forms of networks, such as the query graph. Finally, the
widespread use of Social Media applications, such as Bibsonomy, IMDB, Flickr and
YouTube, is responsible for the creation of even more networks, ranging from
folksonomy networks to rich media social networks. Not only is it possible by
analyzing such networks to gain insights into the social phenomena and processes that
take place in the world, but one can also extract actionable knowledge that can be
beneficial in several information management and retrieval tasks, such as online
content navigation and recommendation. However, the analysis of such networks
poses serious challenges to data mining methods, since these networks are almost
invariably characterized by huge scales and a highly dynamic nature.
A valuable tool in the analysis of large complex networks is community detection.
The problem that community detection attempts to solve is the identification of
groups of vertices that are more densely connected to each other than to the rest of the
network. Detecting and analyzing the community structure of networks has led to
important findings in a wide range of domains, ranging from biology to social
sciences and the Web. Such studies have shown that communities constitute
meaningful units of organization and that they provide new insights in the structure
and function of the whole network under study. Recently, there has been increasing
interest in applying community detection on Social Media networks not only as a
means of understanding the underlying phenomena taking place in such systems, but
also to exploit its results in a wide range of intelligent services and applications, e.g.
recommendation engines, automatic event detection in Social Media content.
2
1.2 Community
It is formed by individuals such that those within a group interact with each other
more requently than with those outside the group. A network community (also
sometimes referred to as a module or cluster) is typically thought of as a group of
nodes with more and/or better interactions amongst its mem bers than between its
members and the remainder of the network. Figure 1.1 shows a sample community
structure with three interlinked communities.
Figure 1.1: Community structure
1.3 Community Detection
Several attempts have been made to provide a formal definition for this generally
described community detection concept in networks. A strong community was defined
as a group of nodes for which each node of the community has more edges to other
nodes of the same community than to nodes outside the community. This is a
relatively strict definition, in the sense that it does not allow for overlapping
communities and creates a hierarchical community structure since the entire graph can
be a community itself. A weak community was later defined as a subgraph in which
the sum of all node degrees within the community is larger than the sum of all node
degrees toward the rest of the graph [6].
3
It is the process of discovering groups in a network where individuals group
memberships are not explicitly given. The problem of cluster or community detection
in real world graphs that involves large social networks, web graphs and biological
networks is a problem of considerable practical interest and has received a lot of
attention recently. To extract such sets of nodes one typically chooses an objective
function that captures the above intuition of a community as a set of nodes with better
internal connectivity than external connectivity. Then, since the objective is typically
NP-hard to optimize exactly, one employs heuristics or approximation algorithms to
find sets of nodes that approximately optimize the objective function and that can be
understood or interpreted as real communities. Alternatively, one might define
communities operationally to be the output of a community detection procedure,
hoping they bear some relationship to the intuition as to what it means for a set of
nodes to be a good community. Once extracted, such clusters of nodes are often
interpreted as organizational units in social networks, functional units in biochemical
networks, ecological niches in food web networks, or scientific disciplines in citation
and collaboration networks.
1.4 Practical applications of clustering, community detection
 Recommendation tools for forming on-line groups have the potential to collect a
few initial suggestions from a user and then produce a longer list of recommended
group members.
 Similarly, a marketer may want to expand a set of a few interested consumers of a
product into a longer list of people who might also be interested in the product.
 Seed set expansion has also been used to infer missing attributes in user profile
data [3] and to detect e-mail addresses of spammers.
 Simplifies visualization, analysis on complex graphs
 Search engines – Categorization. Figure 1.2 depicts the a dendo-gram of various
search engines. A similarity threshold is employed to isolate the clusters of similar
qualities.
4
Figure 1.2: Categorization of various search engines
 Social networks - Useful for tracking group dynamics. Social media network’s
typical life cycle has been depicted in figure 1.3. Raw social media network is
formulated by clustering recorded transactions. It is further simplified to form a clean
network.
Figure 1.3: Typical life cycle of a social media network.
 Neural networks - Tracks functional units
One major challenge in neuroscience is to identify the functional modules from
multichannel, multiple subjects’ recordings. Most research on community detection
has focused on finding the association matrix based on functional connectivity,
instead of effective connectivity, thus not capturing the causality in the network.
 Food webs - helps isolate co-dependent groups of organisms
1. Clustering obtained by
cutting the dendo-gram at a
desired level.
2. Each connected component
forms a cluster.
3. Basically we use some
similarity threshold to get
the clusters of desired
quality.
Recorded
Transactions
Raw Social Media
Network
Clean Simplified
Network
5
1.5 Purpose of community detection
 Understanding the interactions between people.
 Visualising and navigating huge networks.
 Forming the basis for other tasks such as data mining.
 Social networks often include community groups based on common location,
interests, occupation, etc. Communities are present in metabolic networks
based on functional groupings. Communities are formed in citation networks
based on research topic. By identifying these sub-structures within a network
can provide knowledge about how network function and topology affect each
other.
1.5.1 Is it necessary to extract groups based on network topology?
 All Social media websites do not provide community platform
 All people do not want to make effort to join groups.
 Through community extraction communities can be suggested to people based
on their interests.
 Groups in the real world change dynamically.
 Besides social media websites it is essential to extract communities in other
networks such as citation networks, World Wide Web, metabolism networks
for various practical purposes.
1.5.2 Importance of network interaction
 Rich information about the relationship between users can be obtained through
analysing network interaction which can complement other kinds of
information, e.g. user profile.
 It provides basic information that are essential for other tasks, e.g.
recommendation.
 Analysing network interaction helps in network visualization and navigation.
6
1.6 Challenges
The major challenges usually encountered in the problem of community detection in
networks are highlighted below:
 Scalability
The amount of online media content over the internet is rising every day at a
tremendous rate. Currently, the sizes of such networks are in scale of billions of nodes
and connections. As the network is expanding, both the space requirement to store the
network and time complexity to process the network would increase exponentially.
This imposes a great challenge to the conventional community detection algorithms.
Traditional community detection methods often deal with thousands of nodes or more.
 Heterogeneity
Raw media networks comprise multiple types of edges and vertices. Usually, they are
represented as hypergraphs or k-partite graphs. Majority of community detection
algorithms are not applicable to hypergraphs or k-partite graphs. For that reason, it is
common practice to extract simplified network forms that depict partial aspects of the
complex interactions of the original network.
 Evolution
Due to highly dynamic nature of social media data, the evolving nature of network
should be taken into account for network analysis applications. So far, the discussion
on community detection has progressed under the silent assumption that the network
under consideration is static. Time awareness should be incorporated in the
community detection approaches.
7
 Evaluation
The lack of reliable ground-truth makes the evaluation extremely difficult [7].
Currently the performance of community detection methods is evaluated by manual
inspection. Such anecdotal evaluation procedures require extensive manual effort, are
non-comprehensive and limited to small networks.
 Privacy
Privacy is a big concern in social media. Facebook, Google often appear in debates
about privacy. Simple anonymity does not necessarily protect privacy. As private
information is involved, a secure and trustable system is critical. Hence, lot of
valuable information is not made available due to security concerns.
1.7 Thesis Overview
The remainder of thesis is organized as follows. In the next chapter, a background of
the algorithms researched upon and their pros and cons are discussed. The subsequent
parts deal with the development of an improved version of the existing PageRank
algorithm and how this can be used to solve the problem of community expansion.
The third chapter explains the implementation of the proposed improved PageRank
algorithm. The final chapter deals with the comparison between the three algorithms
and a brief performance analysis of the improved algorithm. The thesis ends with a
short conclusion and notes on future scope of these algorithms.
8
CHAPTER 2
LITERATURE SURVEY
2.1 Neighbour Counting Algorithm
The algorithm works in two phases: community discovery phase and the expanding
phase. Discovery is concerned with finding a group of entities that are members of a
community, while expanding seeks to identify the nature of a community given its
membership.
2.1.1 Community Discovery Methods
Members of natural groups in a network will tend to have a high density of
connections between them, with lower connectivity between different groups.
Discovering communities is typically viewed as a clustering problem, with specific
techniques being more applicable to social networks. A large class of methods deal on
a global scale, where every single vertex is assigned to a single community. An
overview of these methods follows.
2.1.1.1 Graph Partition Techniques
Bisection techniques attempt to partition the network into two relatively separate
subgraphs. Several methods are effective to identifying a single bisection, but work
less well on graphs containing many distinct communities. An external decision must
be made to indicate when to stop bisecting, that is, and how many communities exist
in the graph. Methods include:
 Max Flow/Min Cut. These methods can produce good bisections, but make
no guarantees about keeping both groups of similar size. Flake et al. give a
min-cut algorithm based on min-cut trees which is able to produce an arbitrary
number of clusters, and can be expanded to produce a hierarchical clustering.
 Spectral Bisection. Spectral bisection techniques partition a graph based on
the eigenvectors of its Laplacian. The Laplacian Q of a graph G is defined as
9
Q = D− A, where D is an n×n diagonal matrix with dv, v = d(v) and A is the
adjacency matrix of G. The spectral bisection method finds the eigenvector
corresponding to the second smallest eigenvalue λ2
, and bisects the graph on
whether the eigenvector entry for a vertex is positive or negative. Λ2
is also
called the algebraic connectivity of a graph. A smaller value indicates a better
split into two groups [5].
 Kernighan-Lin Algorithm. This heuristic algorithm attempts to greedily
minimize the “external cost” of a partition, which is the sum of the cost of
inter-partition edges. It starts with an initial (possibly random) partition, and
determines the pair of vertices whose swap would produce the largest decrease
in cost. This gives a sequence of vertex swaps which is then scanned to find
the minimum. The procedure is then repeated with the new partition as the
starting point, until convergence on a local minimum is achieved.
2.1.1.2 Hierarchical Clustering
Hierarchical clustering techniques are driven by an application-specific similarity
measure between the groups of vertices of a network [Scott 2000]. Techniques
include:
 Agglomerative: In this top-down approach, each vertex initially belongs to its
own cluster. Clusters are merged incrementally in order of increasing cost. In
single linkage clustering, the cost of merging two clusters depends upon the
closest vertex pair spanning them. In complete linkage cluster, the cost is a
sum of the distances of all vertex pairs spanning the clusters. Newman gives
an algorithm based on modularity Q. Given a partition of the vertices, define a
matrix e where eij is the fraction of edges in G between components i and j.
Then Q is defined as
At each step choose to merge the two clusters that cause the greatest increase
in Q. Agglomerative clustering methods do not find peripheral members
10
reliably. An additional level of processing is needed to determine at which
level the hierarchy defines the most meaningful communities.
 Divisive: In divisive hierarchical clustering, the entire graph G begins as one
cluster. Edges are removed to partition the cluster into smaller ones, as
opposed to agglomerative where clusters are joined to larger clusters. Girvan
and Newman gave an algorithm based on edge betweenness centrality. The
edge with highest betweenness centrality is removed from the graph until no
edges remain. Edge betweenness can be calculated in O (mn), giving a total
computation time of O(m^2n).
Clauset [2] et al. state that hierarchical structure is actually a defining component of
social networks; sufficient to explain power law degree distributions, high clustering
coefficients, and short path lengths (the small world phenomenon). The hierarchical
random graph model is a dendogram, with probabilities at internal nodes. The
probability of an edge between two leaves is equal to the value in their lowest
common ancestor. This model produces networks exhibiting the properties of small-
world networks. They also give a statistical based algorithm for inferring the most
likely hierarchical random graph model from a given network.
2.1.2 Expanding an existing community
The essential function of a community expansion method is to identify the
most promising next member to add to the community. This is achieved by assigning
a score to all entities in the network, and selecting the highest-scoring outside vertex
to join the community. Given below is a description of several different possible
scoring criteria to rank the selection:
 Neighbour Count: The most obvious candidates for incorporation have many
neighbours in the community. Basketball players tend to be associated with
other basketball players, musicians with other musicians, etc.
 Juxtaposition Count: One drawback of using a simple neighbour count
criterion is that each neighbour is given the same weight, regardless of the
strength of the relation. The edge weights defining the network are co-
11
occurrence frequencies of the given entity pair. Using such juxtaposition
weights assigns more importance to neighbours that are more frequently
associated in the text with in-community members.
 Neighbour Ratio: A failing of such counting scores is that the status of
ubiquitous entities gets artificially elevated. A frequent entity like “George
Bush” has over a thousand neighbours in the graph, and hence will have
neighbours from many communities. Say six of these neighbours are chemists.
The raw neighbour count score would identify George Bush as more likely to
be a chemist than John Dalton, an entity that has only 8 neighbours (5 of
which are chemists). But if the vertex degree is factored in and a ratio used,
Dalton becomes promoted to the most likely chemist.
 Juxtaposition Ratio: The bias to ubiquitous entities is also present in
juxtaposition counts. Edges to “George Bush” tend to have high weight,
simply because of the total frequency of the entity. Using a ratio helps control
for high-frequency vertices.
 Binomial Probability: Using ratios has the problem of artificially elevating
the importance of infrequent entities. An entity with 100 neighbours, 60 of
which are chemists, would have a neighbour ratio of 0.6. But an entity with a
single neighbour who happened to be a chemist would have a ratio of 1.
Normalize for this by computing the probability PR[n, k] that an entity would
happen to have at least k of its n neighbours from the group by chance. So:
where p is the fraction of known-group members to network nodes. When
Pr[n, k] is extremely low for an observed k in-group neighbours, than it can be
reasoned that the entity must be a member of the community.
Assume the community C, and the set of nodes adjacent to the community, B (each
has at least one neighbour in C). At each step, one or more nodes from B are chosen
and agglomerated into C. Then B is updated to include any newly discovered nodes.
12
This continues until an appropriate stopping criterion is satisfied. When the
algorithms begin, C = {s} and B contains the neighbours of s: B = {n(s)}.
The Clauset algorithm [2] focuses on nodes inside C that form a “border” with B:
each has at least one neighbour in B. Denoting this set Cborder and focusing on
incident edges, Clauset defines the following local modularity:
where βij is the adjacency matrix comprising only those edges with one or more
endpoints in Cborder and [P] = 1 if proposition P is true, and zero otherwise. Each
node in B that can be agglomerated into C will cause a change in R, ∆R, which may
be computed efficiently. At each step, the node with the largest ∆R is agglomerated.
This modularity R lies on the interval 0 ≤ R ≤ 1 (defining R = 1 when |Cborder| = 0) and
local maxima indicate good community separation. For a network of average degree
d, the cost to agglomerate |C| nodes is O(|C|2
d) [6].
The LWP algorithm defines a different local modularity, which is closely related to the
idea of a weak community [9]. Define the number of edges internal and external to C
as Min and Mout, respectively:
The LWP local modularity Mf is then:
When Mf > 1/2, C is a weak community, according to the algorithm consists of
agglomerating every node in B that would cause an increase in Mf, ∆Mf > 0, then
13
removing every node from C that would also lead to ∆Mf > 0 so long as the node’s
removal does not disconnect the subgraph induced by C. (Removed nodes are not
returned to B, they are never re-agglomerated.) Finally B is updated and the process
repeats until a step where the net number of agglomerations is zero. The algorithm
returns a community if Mf > 1 and s ∈ C. Similar to the Clauset method [2], the cost
of agglomerating |C| nodes is O(|C|^2d).
A number of approaches evaluated nodes based on the number of neighbors
they had in and out of the community, adding nodes to the community when they
optimized a function of a specific quantity. Bagrow [1] did this for a measure called
outwardness, defined as the degree-normalized difference between neighbors inside
and outside the community.
The “outwardness” Ωv(C) of node v ∈ B from community C:
where n(v) are the neighbours of v. In other words, the outwardness of a node is the
number of neighbours outside the community minus the number inside, normalized by
the degree. Thus, Ωv has a minimum value of −1 if all neighbours of v are inside C,
and a maximum value of 1 − 2/kv, since any v ∈ B must have at least one neighbour
in C. Since finding a community corresponds to maximizing its internal edges while
minimizing external ones, agglomerate the node with the smallest Ω at each step,
breaking ties at random.
Figure 2.1 a: The community C is surrounded by a boundary of explored nodes B.
This exploration implies an additional layer of nodes that are known only due to their
adjacencies with B.
Figure 2.1 b: Two nodes i and j in B, with Ωi = 2/3 and Ωj = −1. Moving node j into C
will give improved community structure, compared to moving i.
14
Figure 2.1(a) and (b): Neighbour counting explanation.
2.2 Greedy Algorithm
Greedy algorithm for maximising modularity
Input: graph G = (V, E)
Output: clustering C of G
C ← singletons
initialize matrix ∆
while |C| > 1
do
find {i, j} with ∆i,j is the maximum entry in the matrix
∆
merge clusters i and j
update ∆
return clustering with highest modularity
15
The greedy algorithm starts with the singleton clustering and iteratively merges those
two clusters that yield a clustering with the best modularity, i. e., the largest increase
or the smallest decrease is chosen. After n−1 merges the clustering that achieved the
highest modularity is return. The algorithm maintains a symmetric matrix ∆ with
entries ∆i,j := q (Ci,j) − q (C), where C is the current clustering and Ci,j is obtained
from C by merging clusters Ci and Cj . Note that there can be several pairs i and j
such that ∆i,j is the maximum, in these cases the algorithm selects an arbitrary pair.
The pseudo-code for the greedy algorithm is given in Algorithm 1.
An efficient implementation using sophisticated data-structures requires O (n^2*log
n) runtime[1]. Note that, n−1 iterations is an upper bound and one can terminate the
algorithms when the matrix ∆ contains only non-positive entries. This property is
called single-peakedness. Since it is N P-hard to maximize modularity [1] in general
graphs, it is unlikely that this greedy algorithm is optimal. In fact, in the graph family,
the above greedy algorithm has an approximation factor of 2, asymptotically.
Furthermore, we point out instances where a specific way of breaking ties of merges
yield a clustering with modularity of 0, while the optimum clustering has a strictly
positive score. Modularity is defined such that it takes values in the interval [−1/2, 1]
for any graph and any clustering. In particular the modularity of a trivial clustering
placing all vertices into a single cluster has a value of 0. We use this technical
peculiarity to show that the greedy algorithm has an unbounded approximation ratio.
16
2. 3 Drawbacks of Greedy Algorithm
 Asymptotic growth of value of a metric implies a strong dependence on the
size of the network and the number of modules the network contains [6].
 Resolution limit is a problem where communities of certain small size are
merged into larger ones [6]. A classic example where modularity cannot
identify communities of small size is a cycle of m cliques. Here maximum
modularity is obtained if two neighbouring cliques are merged [4].
 Degeneracy of solution is a problem where a community scoring function (e.g.
modularity) admits multiple distinct high-scoring solutions and typically lacks
a clear global maximum, thereby, resorting to tie-breaking [6].
2.4 Drawbacks of PageRank Algorithm
Modularity is a property of the network that measures when the division is good, in
the sense that there are many edges within the community and only a few between
them. In modularity based algorithm's, each node of the graph is considered as an
individual community and the communities are joined iteratively based on the
increase in modularity caused by their joining. The ones producing maximum change
in modularity are joined. There are few drawbacks associated with modularity based
methods such as they require information regarding the entire structure of the network
which is not possible to determine in case of vast real world networks. Also
modularity optimization methods are not able to determine the overlapping
communities [1]. In order to detect overlapping communities clique percolation can
be used . Clique percolation is based on the assumption that a community consists of
fully connected subgraphs and detects overlapping communities by searching for
adjacent cliques. But it is a hard method to implement well due to difficulty of
producing intermediate representations [4] of percolating structures.
17
CHAPTER 3
IMPLEMENTATION
3.1 Finding potential members
The main aim is finding all the potential members that can be added into the
community. The algorithm basically considers all the neighbours of the given
community .All the neighbours are extracted into ADJ[] array. The algorithm grows in
a dynamic fashion. The member with the highest pagerank is added into the
community. Then a fresh set of neighbour is extracted into the ADJ[] array
considering the new member added to the community and the same process is
repeated. A flag array CHECK_ARR[] is given to check if the neighbouring member
has the same set of interest as the community members.
Figure 3.1(a): Finding potential neighbour step one
18
Figure 3.1(b): Finding potential neighbour step two
1N : 1-hop neighbour
2N: 2-hop neighbour
C: Community
Figure 3.1(a) and 3.1(b) is a snapshot of a very big graph (the input, a webgraph). The
initial set of neighbours are (P, Q, R, S) Assume that the PageRank(R) is the highest
among all the neighbours. So, the node R is added to the community. Now the next
list of 1N (one hop) neighbours are considered, that is, neighbours (P, Q, T, S) are
considered. So the algorithm proceeds in a dynamic fashion. At every step it checks if
the neighbour has the same set of interest as that of the community. If yes, the
neighbour node will be added to the ADJ[] array, else it will pass on to the next hop
neighbour.
Figure 3.2 depicts a flowchart that shows how the expansion algorithm executes to
find potential members and add them to the community.
19
Figure 3.2: Flowchart depicting the algorithm
3.2 Improved PageRank
The proposed algorithm is based on mean value of page ranks of all web pages with
performance advantages over the traditional Page Rank algorithm. A novel approach
for reducing the number of iterations performed in Page Rank algorithm to reach a
convergence point:
 Initially assume PAGE RANK of all web pages to be any value, let it be 1.
 Calculate page ranks of all pages by following formula
PR(A) = .15/N + .85 (PR(T1)/C(T1) +PR(T2)/C(T2) + ....... + PR(Tn)/C(Tn))
o T1 through Tn are pages providing incoming links to Page A
o PR(T1) is the Page Rank of T1
Construct the neighbour set of
the given community.
Find the neighbour with maximum PageRank. Add
the neighbour to the community.
Construct a fresh neighbour set considering the
neighbours of the newly added member.
Repeat steps 2 and 3 until stopping
criteria is met.
Make sure the neighbour has
same set of interest as that of
Community, else, consider
the next one hop neighbour.
20
o PR(Tn) is the Page Rank of Tn
o C(Tn) is total number of outgoing links on Tn
o N is the total number of nodes available in the graph.
 Calculate mean value of all page ranks by following formula :-
o Summation of page ranks of all web pages / number of web pages.
 Then normalize page rank of each page
o Norm PR (A) = PR (A) / mean value
o Where norm PR (A) is Normalized Page Rank of page A and PR (A) is
page rank of page A
 Assign PR(A)= Norm PR (A)
 Repeat step 2 to step 4 until page rank values of two consecutive iterations are same.
o The pages which have the highest page rank are more significant
pages.
When running the original PageRank algorithm, the values of individual PageRanks
of webpages keeps oscillating about their final value. Saturation is reached after a
number of iterations when the value converges to a single value according to a
convergence factor.
In the proposed improved PageRank algorithm, this oscillation is minimized by
normalizing the PageRank values received after every iteration thereby bringing the
current value closer to the saturation point every time. The procedure for the
execution of proposed improved PageRank algorithm has been neatly depicted as a
flowchart in figure 3.3.
21
Figure 3.3: Flowchart depicting Improved PageRank Algorithm
Input:
 Web graph with 50000 vertices (number from 0 to 49999) and 50000 edges.
 Nodes that are part of the community(around 100 nodes)
22
Figure 3.4: Input webgraph (50000 vertices and 50000 edges)
Figure 3.5: Existing community members
Output:
 PageRank of each and every node calculated using the improved page rank
algorithm.
 Potential members that can be added to the community, decided by using the
PageRank values obtained above.
23
Figure 3.6: Calculating PageRank using the existing code.
Figure 3.7: Calculating PageRank using improved algorithm.
24
0
10
20
30
40
50
60
70
80
30 40 50 60 70 80 90 100
NumberofIterationstofindPageRank
Number of Nodes (millions)
PageRank
Improved PageRank
Figure 3.8: Using the PageRank values got above, the potential members that can
be added to the community are determined.
c) Performance improvement graphs – Traditional PageRank vs Improved PageRank
Figure 3.9: Graph showing analysis between PageRank and
Improved page rank.
25
CHAPTER 4
PERFORMANCE ANALYSIS
4.1 Neighbour Counting algorithm
Q = 0.1152
O(5) = 0.75
O(6) = 0.25
Q = 0.1308
O(5) = 0.20
O(8) = 0.00
Q = 0.1309
O(5) = -0.167
O(7) = 0.00
Q = 0.2051
O(7) = -0.2857
Q = 0.2871
Figure 4.1: Neighbour counting algorithm steps
26
Figure 4.1 show the process of employing neighbour counting algorithm to expand a
given community. Also, the modularity of the new community is calculated. The next
possible addition to the community is chosen by the property of outwardness(O) [1];
the immediate neighbour with the least outwardness is the most potent member in the
community.
4.2 Greedy algorithm
Q = 0.1152
O(5) = 0.75
O(6) = 0.25
Q = 0.1367
O(6) = 0.20
O(7) = 0.00
O(8) = 0.40
Q = 0.1758
O(6) = 0.167
O(8) = 0.00
Q = 0.2207
O(6) = -0.143
Q = 0.2871
Figure 4.2: Greedy algorithm steps
27
The PageRank of all the nodes is calculated using the greedy algorithm. The expanded
community in different iterations are depicted in figure 4.2. Adding more members to
the community is decided based on the modularity value of the attained subgraph. The
outwardness(O) of the options are listed.
4.3 PageRank Algorithm
Q = 0.1152
O(5) = 0.75
O(6) = 0.25
Q = 0.1308
O(5) = 0.20
O(8) = 0.00
Q = 0.1309
O(5) = -0.167
O(7) = 0.00
Q = 0.2480
O(5) = -0.4285
Q = 0.2871
Figure 4.3: Proposed improved PageRank algorithm steps
28
The network shown in figure 4.3 depicts the expansion process using proposed
improved pagerank algorithm.
4.4 Comparing performances of PageRank and the proposed Improved
PageRank
The existing PageRank algorithm and the proposed Improved PageRank algorithm
were executed to find the PageRank of all the nodes in the previous graph. Each graph
from figure 4.4(a) to 4.4(e) represents an iteration of the process.
Figure 4.4(a): Graph showing iteration 1 PageRank values
Figure 4.4(b): Graph showing iteration 2 PageRank values
29
Figure 4.5(c): Graph showing iteration 1 PageRank values
Figure 4.4(d): Graph showing iteration 5 PageRank values
30
Figure 4.4(e): Graph showing iteration 2 PageRank values
Figure 4.5 and 4.6 are graphs which show a comparison between the three algorithms
which were tested based on two different characteristics: modularity and
outwardness.
Figure 4.5: Modularity- Comparison between the three algorithms.
31
Figure 4.6: Outwardness- Comparison between the three algorithms.
4.5 Inference
Modularity is found to be continually increasing for all three algorithms. Since the
greedy structural optimization method relies on adding members to the group based
on maximizing modularity, it is found to show better results than neighbour counting.
Neighbour counting proceeds by adding nodes with least outwardness value and
hence the resulting community is found to have lesser outwardness than greedy
structural optimization and PageRank algorithm.
Finally, it has been found that in the longer run, PageRank algorithm deduces a graph
with higher modularity and comparable outwardness to both greedy structural
optimization and neighbour counting algorithm. This makes the final result more
efficient thereby letting the system add more potent neighbours into the given
community.
The proposed improved PageRank algorithm reduces the number of iterations taken to
reach saturation in comparison to traditional pagerank algorithm. This reduces the
time taken to propose new members to be added to the community thereby improving
the efficiency in expanding a given community using seed sets. This effect is more
pronounced when the data set is of the order of one lakh nodes.
32
CONCLUSION AND FUTURE WORK
The seed set expansion problem has its roots in a number of overlapping areas,
including the problem of identifying central nodes in social networks [3] and finding
related and/or important Web pages from an initial set of query results [3]. In
particular, the PageRank algorithm broadened from its initial focus on Web search [9]
to also include methods for finding nodes “similar” to an initial root, by starting short
random walks from the root and seeing which other nodes were likely to be reached
[3].
The seed set expansion problem has been gaining visibility as a general-purpose
framework for identifying members of a networked community from a small set of
initial examples. But subtle trade-offs in the formulation and underlying methods can
have a significant impact on the way this process works, and in this project, several
such principles have been identified about the relative power of different expansion
heuristics, and the structural properties of the initial seed set. The investigations have
involved analyses of datasets across diverse domains as well as theoretical trade-offs
between different problem formulations. There are a number of interesting directions
for further work.
In particular, the power of PageRank-based methods raises the question of whether
these are indeed the “right” algorithms for seed set expansion, or whether they should
be viewed as proxies for a richer set of probabilistic approaches that could yield
strong performance. Second, the damping factor which is assumed to be a constant
value can be changed and it can be said that over different seed sets tampering with
varying values of the damping factor could lead to anomalies and special cases which
need to be carefully studied; a richer understanding of the seed sets that lead to the
most effective expansions to a larger community could provide useful insights for the
application of these methods. And finally, as noted earlier, nodes in a network tend to
belong to multiple communities simultaneously, and a robust way of expanding
several overlapping communities together is a natural question for further study.
33
REFERENCES
[1] James P Bagrow. “Evaluating local community methods in networks “. Journal
of Statistical Mechanics: Theory and Experiment, pages:15-19, 2008.
[2] Aaron Clauset. Finding local community structure in networks. Physical review
E, 72(2):026132, 2005.
[3] Isabel M. Kloumann and Jon M. Kleinberg. "Community membership
identification from small seed sets”. KDD, 2014.
[4] Andrew Mehler and Steven Skiena. “Expanding network communities from
representative examples”. ACM Transactions on Knowledge Discovery from Data
(TKDD), pages:14-19, 2009.
[5] Jaewon Yang and Jure Leskovec. “Defining and evaluating network
communities based on ground-truth”. In In MDS ’12, page 3. ACM, 2012.
[6] J. Leskovec, K. J. Lang, and M. Mahoney. “Empirical comparison of
algorithms for network community detection”. In WWW, pages 631-640, New York,
USA, 2010.
[7] Reid Andersen, Fan Chung, and Kevin Lang. “Local graph partitioning using
pagerank vectors”. In Foundations of Computer Science,pages 475-486, 2006.
[8] Alan Mislove, Bimal Viswanath, Krishna P Gummadi, and Peter Druschel.
“You are who you know: inferring user profiles in online social networks”. In In
WSDM ’10, pages 251–260. ACM, 2010.
[9] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. “The
pagerank citation ranking: Bringing order to the web”. 2011.
34
BIBLIOGRAPHY
[1] Community Detection in Social Media – www.slideshare.com
[2] Empirical Comparison of Algorithms in Network Community
Detection– dl.acm.org
[3] Louvain Method for Community Detection– perso.uclouvain.be
[4] Applications of Community Detection – royalsocietypublishing.org
[5] IEEE Xplore – ieeexplore.ieee.org
[6] Wikipedia – wikipedia.org
[7] SNAP – Stanford Database Collection
35
APPENDICES
36
APPENDIX A
SNIPPETS OF CODES
 Improved PageRank algorithm
void Table::pagerank() {
vector<size_t>::iterator ci; // current incoming
double diff = 1;
size_t i;
double sum_pr; // sum of current pagerank vector elements
double dangling_pr; // sum of current pagerank vector elements for dangling
// nodes
unsigned long num_iterations = 0;
vector<double> old_pr;
size_t num_rows = rows.size();
if (num_rows == 0) {
return;
}
pr.resize(num_rows);
pr[0] = 1;
if (trace) {
print_pagerank();
}
while (diff > convergence && num_iterations < max_iterations) {
sum_pr = 0;
dangling_pr = 0;
for (size_t k = 0; k < pr.size(); k++) {
double cpr = pr[k];
sum_pr += cpr;
if (num_outgoing[k] == 0) {
dangling_pr += cpr;
}
}
if (num_iterations == 0) {
old_pr = pr;
} else {
for (i = 0; i < pr.size(); i++) {
old_pr[i] = pr[i] / sum_pr;
}
}
sum_pr = 1;
double one_Av = alpha * dangling_pr / num_rows;
double one_Iv = (1 - alpha) * sum_pr / num_rows;
diff = 0;
for (i = 0; i < num_rows; i++) {
37
double h = 0.0;
for (ci = rows[i].begin(); ci != rows[i].end(); ci++) {
double h_v = (num_outgoing[*ci])
? 1.0 / num_outgoing[*ci]
: 0.0;
if (num_iterations == 0 && trace) {
cout << "h[" << i << "," << *ci << "]=" << h_v << endl;
}
h += h_v * old_pr[*ci];
}
h *= alpha;
pr[i] = h + one_Av + one_Iv;
diff += fabs(pr[i] - old_pr[i]);
}
num_iterations++;
if (trace) {
cout << num_iterations << ": ";
print_pagerank();
}
}
cout<<"n.......num_iterations:"<<num_iterations<<"n";
}
 Adding potential neighbours
struct pair1
{ int vertex;
double pagerank;
};
while(infile_proc1 >> a >> b >> c)
{
if(check_arr[atoi(a)])
{ list<int>:: iterator i;
for(i=adj[atoi(a)].begin(); i!=adj[atoi(a)].end(); ++i)
{
if(!vis[*i] && !check_arr[*i])
{
v[n].vertex=*i;
v[n].pagerank=value[*i];
vis[*i]=1;
total_sum+=value[*i];
n++;
}
}
}
}
for(int j=0;j<n;j++)
for(int k=0;k<n;k++)
{
if(v[j].pagerank > v[k].pagerank)
{
pair1 temp = v[j];
v[j]=v[k];
v[k]=temp;
}
}
38
APPENDIX B
GLOSSARY OF TERMS
[1] Conductance
In graph theory the conductance of a graph G=(V,E) measures how "well-knit" the
graph is: it controls how fast a random walk on G converges to a uniform distribution.
The conductance of a graph is often called the Cheeger constant of a graph as the
analog of its counterpart in spectral geometry [8].
[2] Modularity
Modularity is one measure of the structure of networks or graphs. It was designed to
measure the strength of division of a network into modules (also called groups,
clusters or communities). Networks with high modularity have dense connections
between the nodes within modules but sparse connections between nodes in different
modules. Modularity is often used in optimization methods for detecting community
structure in networks.
[3] Ground-Truth Community
Generally, after communities are identified in a given network, the essential next step
is to interpret them by identifying a common external property [5] that all the
members share and around which the community organizes . Thus, the goal of
network community detection is to identify sets of nodes with a common (often
external/latent/unobserved) property based only the network connectivity structure. A
“common property” can be common attribute, affiliation, role, property, or function..
A distinction is made between network communities and groups. A community is
defined structurally (i.e., a set of nodes extracted by the community detection
39
algorithm), while a group is defined based on nodes sharing a property around which
the nodes organize in the network (e.g., belonging to a common interest based group,
sharing common affiliation)
Using the ground-truth communities allows for quantitative and large-scale
evaluation[5] and comparison of different community detection methods. Such
ability represents a significant step forward as the field can move beyond the current
standard of anecdotal evaluation of communities to comprehensive evaluation of the
performance of community detection methods. Ground-truth communities are
structurally most similar to the communities discovered by random walk method[5].
[4] Ego Networks
Ego networks consist of a focal node ("ego") and the nodes to whom ego is directly
connected to (these are called "alters") plus the ties, if any, among the alters. Of
course, each alter in an ego network has his/her own ego network, and all ego
networks interlock to form The human social network the denser the ties in an ego
network, the stronger the ties, and the more insular the ego network and also the more
homogeneous.
Typical measures:
 Homophily
 Size
 Average strength of ties
 Heterogeneity
 Density
 Composition (e.g., % women, %whites, etc.)
 Range: Substantively defined as potential access to social resources often
defined as diversity of alters based on weak ties argument, density is thought
of as inverse measure of range size and heterogeneity also seen as measures of
range.
40
[5] Outwardness
Outwardness of a node is the number of neighbours outside the community minus the
number inside, normalized by the degree [6]. Thus, Ωv has a minimum value of −1 if
all neighbours of v are inside C, and a maximum value of 1 − 2/kv, since any v ∈ B
must have at least one neighbours in C [4]. Since finding a community corresponds to
maximizing its internal edges while minimizing external ones, agglomerate the node
with the smallest Ω at each step, breaking ties at random.

More Related Content

Viewers also liked

Informática 4º c
Informática 4º cInformática 4º c
Informática 4º clorena_gm
 
Slidshare integrales impropias calculo 2
Slidshare integrales impropias calculo 2Slidshare integrales impropias calculo 2
Slidshare integrales impropias calculo 2luis alejandro
 
NOx Project beschrijving
NOx Project beschrijvingNOx Project beschrijving
NOx Project beschrijvingdaniel Jellema
 
Configuración de cuentas de correo electrónico
Configuración de cuentas de correo electrónicoConfiguración de cuentas de correo electrónico
Configuración de cuentas de correo electrónicoerik.santillan.gtz
 
Take.test Marathon
Take.test MarathonTake.test Marathon
Take.test Marathontakeqa
 
Análisis de-la-estructura-del-cine
Análisis de-la-estructura-del-cineAnálisis de-la-estructura-del-cine
Análisis de-la-estructura-del-cineMauricio Lopez
 

Viewers also liked (11)

1ª Web Zoologia
1ª Web Zoologia1ª Web Zoologia
1ª Web Zoologia
 
Informática 4º c
Informática 4º cInformática 4º c
Informática 4º c
 
Slidshare integrales impropias calculo 2
Slidshare integrales impropias calculo 2Slidshare integrales impropias calculo 2
Slidshare integrales impropias calculo 2
 
NOx Project beschrijving
NOx Project beschrijvingNOx Project beschrijving
NOx Project beschrijving
 
Configuración de cuentas de correo electrónico
Configuración de cuentas de correo electrónicoConfiguración de cuentas de correo electrónico
Configuración de cuentas de correo electrónico
 
Integral impropia
Integral impropiaIntegral impropia
Integral impropia
 
Take.test Marathon
Take.test MarathonTake.test Marathon
Take.test Marathon
 
The ukraine crisis pts 1 and 2
The ukraine crisis pts 1 and 2The ukraine crisis pts 1 and 2
The ukraine crisis pts 1 and 2
 
CV_Golokendu_Updated
CV_Golokendu_UpdatedCV_Golokendu_Updated
CV_Golokendu_Updated
 
Análisis de-la-estructura-del-cine
Análisis de-la-estructura-del-cineAnálisis de-la-estructura-del-cine
Análisis de-la-estructura-del-cine
 
Sistema reprodutor
Sistema reprodutorSistema reprodutor
Sistema reprodutor
 

Similar to Final_Thesis

Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Roman Atachiants
 
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Roman Atachiants
 
User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks Shah Alam Sabuj
 
Zomato Crawler & Recommender
Zomato Crawler & RecommenderZomato Crawler & Recommender
Zomato Crawler & RecommenderShoaib Khan
 
01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPadTraitet Thepbandansuk
 
Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisOktay Bahceci
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
Abstract contents
Abstract contentsAbstract contents
Abstract contentsloisy28
 
MSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPadMSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPadTraitet Thepbandansuk
 

Similar to Final_Thesis (20)

merged_document
merged_documentmerged_document
merged_document
 
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
 
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...
 
User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks
 
Aregay_Msc_EEMCS
Aregay_Msc_EEMCSAregay_Msc_EEMCS
Aregay_Msc_EEMCS
 
Mustufain Rizvi
Mustufain RizviMustufain Rizvi
Mustufain Rizvi
 
Abrek_Thesis
Abrek_ThesisAbrek_Thesis
Abrek_Thesis
 
Zomato Crawler & Recommender
Zomato Crawler & RecommenderZomato Crawler & Recommender
Zomato Crawler & Recommender
 
01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad01 dissertation_Restaurant e-menu on iPad
01 dissertation_Restaurant e-menu on iPad
 
Knapp_Masterarbeit
Knapp_MasterarbeitKnapp_Masterarbeit
Knapp_Masterarbeit
 
Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_Analysis
 
gusdazjo_thesis
gusdazjo_thesisgusdazjo_thesis
gusdazjo_thesis
 
Thesis
ThesisThesis
Thesis
 
DMDI
DMDIDMDI
DMDI
 
thesis
thesisthesis
thesis
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
Z suzanne van_den_bosch
Z suzanne van_den_boschZ suzanne van_den_bosch
Z suzanne van_den_bosch
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
Abstract contents
Abstract contentsAbstract contents
Abstract contents
 
MSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPadMSc Dissertation: Restaurant e-menu software on iPad
MSc Dissertation: Restaurant e-menu software on iPad
 

Final_Thesis

  • 1. IMPROVING EFFICIENCY OF COMMUNITY MEMBER IDENTIFICATION USING SEED SET EXPANSION A thesis submitted in partial fulfilment of the requirements for the award of the degree of B. Tech In Computer Science and Engineering By Abishek Prasanna (106111002) R Sibi (106111068) Rahul R (106111070) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING NATIONAL INSTITUTEOF TECHNOLOGY TIRUCHIRAPALLI-620015 MAY 2015
  • 2. BONAFIDE CERTIFICATE This is to certify that the project titled IMPROVING EFFICIENCY OF COMMUNITY MEMBER IDENTIFICATION USING SEED SET EXPANSION is a bonafide record of the work done by Abishek Prasanna (106111002) R Sibi (106111068) Rahul R (106111070) in partial fulfilment of the requirements for the award of the degree of Bachelor of Technology in Computer Science and Engineering of the NATIONAL INSTITUTE OF TECHNOLOGY, TIRUCHIRAPPALLI, during the year 20014- 2015. Dr. E. Sivasankar Dr. (Mrs) R. Leela Velusamy Guide Head of the Department Project Viva-voce held on _____________________________ Internal Examiner External Examiner
  • 3. i ABSTRACT In many applications, network of people are involved and would like to identify the members of an interesting but unlabelled group or community. Start with a small number of exemplar group members – they may be followers of a political ideology or fans of a music genre – and use those examples to discover the additional members. This problem gives rise to the seed expansion problem in community detection: given example community members, how can the social graph be used to predict the identities of remaining, hidden community members? In contrast with global community detection (graph partitioning or covering), seed expansion is best suited for identifying communities locally concentrated around nodes of interest. A growing body of work has used seed expansion as a scalable means of detecting overlapping communities. Yet despite growing interest in seed expansion, there are divergent approaches in the literature and there still isn’t a systematic understanding of which approaches work best in different domains. Here several variants are evaluated and subtle trade-offs between different approaches have been uncovered. The different ideas in the algorithms that give room for performance gains, focusing on heuristics that one can control in practice are explored. As a consequence of this systematic understanding, several opportunities for performance gains were discovered. We have thereby managed to develop our own modification to the PageRank algorithm and have shown its higher performance in comparison with the existing ones. This leads to interesting connections and contrasts with active learning and the trade-offs of exploration and exploitation. Finally, we explore the expansion problem by bringing in an adaptive algorithm that is found to work well with the improved version that we have come up with. We evaluate our methods across multiple domains, using publicly available datasets with labelled, ground-truth communities. Keywords: Seed set expansion, Ground-truth communities
  • 4. ii ACKNOWLEDGEMENTS We would like to thank our project guide Dr.E.Sivasankar , Assistant Professor, Department of Computer Science and Engineering , for his constant guidance , encouragement and help during the entire duration of the project. His enthusiasm has been a driving force for our efforts through the course of this project. We would also like to offer our sincere thanks to Dr. (Mrs).R.Leela Velusamy, Head of the Department, Computer Science and Engineering, National Institute of Technology, Trichy who provided us with the necessary environment, tools and feedback for the implementation of the project.
  • 5. iii TABLE OF CONTENTS Title Page Number ABSTRACT……………………………………………………………… i ACKNOWLEDGEMENTS …………………………………………….. ii TABLE OF CONTENTS…………………………………………………. iii LIST OF FIGURES ……………………………………………………… v NOTATIONS …………………………………………………………….. vi CHAPTER 1: INTRODUCTION 1.1 Motivation …………………………………………………….. 1 1.2 Community ……………………………………………………. 1 1.3 Community detection …………………………………………. 2 1.4 Practical Applications of Clustering , Community Detection… 3 1.5 Purpose Of Community Detection ……………………………... 4 1.5.1 Is it necessary to extract groups based on network topology? 5 1.5.2 Importance of network interaction ………………………. 5 1.6 Challenges ……………………………………………………… 5 1.7 Thesis Overview………………………………………………… 7 CHAPTER 2: LITERATURE SURVEY 2.1 Neighbour Counting Algorithm …………………………………. 8 2.1.1 Community Discovery Methods …………………………. 8 2.1.1.1 Graph Partition Techniques ………………………… 8 2.1.1.2 Hierarchical Clustering …………………………….. 9 2.1.2 Expanding an existing community………………………… 10 2.2 Greedy algorithm ……………………………………………….. 14 2.2 Drawbacks of Greedy algorithm ……………………………….. 16 2.3 Drawbacks of PageRank algorithm …………………………….. 16
  • 6. iv CHAPTER 3: IMPLEMENTATION 3.1 Finding potential members ………………………………………… 17 3.2 Improved PageRank ……………………………………………….. 19 CHAPTER 4: PERFORMANCE ANALYSIS 4.1 Neighbour Counting ………………………………………………. 25 4.2 Greedy Algorithm …………………………………………………. 26 4.3 PageRank algorithm ………………………………………………. 27 4.4 Comparing performances of PageRank and the proposed Improved PageRank ………………………………………………………….. 28 4.5 Inference …………………………………………………………... 31 CONCLUSION AND FUTURE WORK …………………………………..... 32 REFERENCES ……………………………………………………………….. 33 BIBLIOGRAPHY …………………………………………………………… 34 APPENDICES Appendix A Snippets of Codes ……………………………………... 36 Appendix B Glossary of Terms …………………………………….. 38
  • 7. v LIST OF FIGURES Figure Page No 1.1 Community Structure …..........................................................................… 2 1.2 Categorization of various search engines …............................................... 4 1.3 Typical life cycle of a social media network …………….......................... 4 2.1 Neighbour counting representation ….......................................................... 14 3.1 Finding potential neighbour ….................................................................... 17 3.2 Flowchart depicting neighbour counting algorithm ….................................. 19 3.3 Flowchart depicting improved PageRank algorithm …............................... 21 3.4 Input WebGraph …........................................................................................ 22 3.5 Existing community members …................................................................. 22 3.6 Calculating PageRank using the existing code …......................................... 23 3.7 Calculating PageRank using improved algorithm ….................................... 23 3.8 Potential members are determined …............................................................ 24 3.9 Graph showing analysis between PageRank and Improved page rank …….. 24 4.1 Neighbour counting algorithm steps ………………………………………… 25 4.2 Greedy algorithm steps ……………………………………………………… 26 4.3 Proposed Improved PageRank Algorithm steps ……………………………. 27 4.4 Graphs showing iteration of PageRank ……………………………………… 28 4.5 Modularity Comparison between the three algorithms ……………………… 30 4.6 Outwardness Comparison between the three algorithms ……………………. 31
  • 8. vi NOTATIONS Q Modularity eij Edge directed from node ‘i’ to node ‘j’ Pr[n,k] Probability that entity would happen to have at least ‘k’ of n neighbours from group by chance P Fraction of known group members to network nodes R Local modularity Bij Adjacency matrix comprising only those edges with one or more end points in community Min Number of edges internal to community Mout Number of edges external to community Ov(C) Outwardness of vertex ‘v’ in community ‘C’ Kv Degree of vertex ‘v’ Kv out Number of neighbours outside community Kv in Number of neighbours inside community PR[A] PageRank of node ‘A’ C[A] Total number of outgoing links on ‘A’ N Total number of nodes in the graph d Damping factor
  • 9. 1 CHAPTER 1 INTRODUCTION 1.1 Motivation Networks are omnipresent on the Web. The most profound Web network is the Web itself comprising billions of pages as vertices and their hyperlinks to each other as edges. Moreover, collecting and processing the input of Web users (e.g. queries, clicks) results in other forms of networks, such as the query graph. Finally, the widespread use of Social Media applications, such as Bibsonomy, IMDB, Flickr and YouTube, is responsible for the creation of even more networks, ranging from folksonomy networks to rich media social networks. Not only is it possible by analyzing such networks to gain insights into the social phenomena and processes that take place in the world, but one can also extract actionable knowledge that can be beneficial in several information management and retrieval tasks, such as online content navigation and recommendation. However, the analysis of such networks poses serious challenges to data mining methods, since these networks are almost invariably characterized by huge scales and a highly dynamic nature. A valuable tool in the analysis of large complex networks is community detection. The problem that community detection attempts to solve is the identification of groups of vertices that are more densely connected to each other than to the rest of the network. Detecting and analyzing the community structure of networks has led to important findings in a wide range of domains, ranging from biology to social sciences and the Web. Such studies have shown that communities constitute meaningful units of organization and that they provide new insights in the structure and function of the whole network under study. Recently, there has been increasing interest in applying community detection on Social Media networks not only as a means of understanding the underlying phenomena taking place in such systems, but also to exploit its results in a wide range of intelligent services and applications, e.g. recommendation engines, automatic event detection in Social Media content.
  • 10. 2 1.2 Community It is formed by individuals such that those within a group interact with each other more requently than with those outside the group. A network community (also sometimes referred to as a module or cluster) is typically thought of as a group of nodes with more and/or better interactions amongst its mem bers than between its members and the remainder of the network. Figure 1.1 shows a sample community structure with three interlinked communities. Figure 1.1: Community structure 1.3 Community Detection Several attempts have been made to provide a formal definition for this generally described community detection concept in networks. A strong community was defined as a group of nodes for which each node of the community has more edges to other nodes of the same community than to nodes outside the community. This is a relatively strict definition, in the sense that it does not allow for overlapping communities and creates a hierarchical community structure since the entire graph can be a community itself. A weak community was later defined as a subgraph in which the sum of all node degrees within the community is larger than the sum of all node degrees toward the rest of the graph [6].
  • 11. 3 It is the process of discovering groups in a network where individuals group memberships are not explicitly given. The problem of cluster or community detection in real world graphs that involves large social networks, web graphs and biological networks is a problem of considerable practical interest and has received a lot of attention recently. To extract such sets of nodes one typically chooses an objective function that captures the above intuition of a community as a set of nodes with better internal connectivity than external connectivity. Then, since the objective is typically NP-hard to optimize exactly, one employs heuristics or approximation algorithms to find sets of nodes that approximately optimize the objective function and that can be understood or interpreted as real communities. Alternatively, one might define communities operationally to be the output of a community detection procedure, hoping they bear some relationship to the intuition as to what it means for a set of nodes to be a good community. Once extracted, such clusters of nodes are often interpreted as organizational units in social networks, functional units in biochemical networks, ecological niches in food web networks, or scientific disciplines in citation and collaboration networks. 1.4 Practical applications of clustering, community detection  Recommendation tools for forming on-line groups have the potential to collect a few initial suggestions from a user and then produce a longer list of recommended group members.  Similarly, a marketer may want to expand a set of a few interested consumers of a product into a longer list of people who might also be interested in the product.  Seed set expansion has also been used to infer missing attributes in user profile data [3] and to detect e-mail addresses of spammers.  Simplifies visualization, analysis on complex graphs  Search engines – Categorization. Figure 1.2 depicts the a dendo-gram of various search engines. A similarity threshold is employed to isolate the clusters of similar qualities.
  • 12. 4 Figure 1.2: Categorization of various search engines  Social networks - Useful for tracking group dynamics. Social media network’s typical life cycle has been depicted in figure 1.3. Raw social media network is formulated by clustering recorded transactions. It is further simplified to form a clean network. Figure 1.3: Typical life cycle of a social media network.  Neural networks - Tracks functional units One major challenge in neuroscience is to identify the functional modules from multichannel, multiple subjects’ recordings. Most research on community detection has focused on finding the association matrix based on functional connectivity, instead of effective connectivity, thus not capturing the causality in the network.  Food webs - helps isolate co-dependent groups of organisms 1. Clustering obtained by cutting the dendo-gram at a desired level. 2. Each connected component forms a cluster. 3. Basically we use some similarity threshold to get the clusters of desired quality. Recorded Transactions Raw Social Media Network Clean Simplified Network
  • 13. 5 1.5 Purpose of community detection  Understanding the interactions between people.  Visualising and navigating huge networks.  Forming the basis for other tasks such as data mining.  Social networks often include community groups based on common location, interests, occupation, etc. Communities are present in metabolic networks based on functional groupings. Communities are formed in citation networks based on research topic. By identifying these sub-structures within a network can provide knowledge about how network function and topology affect each other. 1.5.1 Is it necessary to extract groups based on network topology?  All Social media websites do not provide community platform  All people do not want to make effort to join groups.  Through community extraction communities can be suggested to people based on their interests.  Groups in the real world change dynamically.  Besides social media websites it is essential to extract communities in other networks such as citation networks, World Wide Web, metabolism networks for various practical purposes. 1.5.2 Importance of network interaction  Rich information about the relationship between users can be obtained through analysing network interaction which can complement other kinds of information, e.g. user profile.  It provides basic information that are essential for other tasks, e.g. recommendation.  Analysing network interaction helps in network visualization and navigation.
  • 14. 6 1.6 Challenges The major challenges usually encountered in the problem of community detection in networks are highlighted below:  Scalability The amount of online media content over the internet is rising every day at a tremendous rate. Currently, the sizes of such networks are in scale of billions of nodes and connections. As the network is expanding, both the space requirement to store the network and time complexity to process the network would increase exponentially. This imposes a great challenge to the conventional community detection algorithms. Traditional community detection methods often deal with thousands of nodes or more.  Heterogeneity Raw media networks comprise multiple types of edges and vertices. Usually, they are represented as hypergraphs or k-partite graphs. Majority of community detection algorithms are not applicable to hypergraphs or k-partite graphs. For that reason, it is common practice to extract simplified network forms that depict partial aspects of the complex interactions of the original network.  Evolution Due to highly dynamic nature of social media data, the evolving nature of network should be taken into account for network analysis applications. So far, the discussion on community detection has progressed under the silent assumption that the network under consideration is static. Time awareness should be incorporated in the community detection approaches.
  • 15. 7  Evaluation The lack of reliable ground-truth makes the evaluation extremely difficult [7]. Currently the performance of community detection methods is evaluated by manual inspection. Such anecdotal evaluation procedures require extensive manual effort, are non-comprehensive and limited to small networks.  Privacy Privacy is a big concern in social media. Facebook, Google often appear in debates about privacy. Simple anonymity does not necessarily protect privacy. As private information is involved, a secure and trustable system is critical. Hence, lot of valuable information is not made available due to security concerns. 1.7 Thesis Overview The remainder of thesis is organized as follows. In the next chapter, a background of the algorithms researched upon and their pros and cons are discussed. The subsequent parts deal with the development of an improved version of the existing PageRank algorithm and how this can be used to solve the problem of community expansion. The third chapter explains the implementation of the proposed improved PageRank algorithm. The final chapter deals with the comparison between the three algorithms and a brief performance analysis of the improved algorithm. The thesis ends with a short conclusion and notes on future scope of these algorithms.
  • 16. 8 CHAPTER 2 LITERATURE SURVEY 2.1 Neighbour Counting Algorithm The algorithm works in two phases: community discovery phase and the expanding phase. Discovery is concerned with finding a group of entities that are members of a community, while expanding seeks to identify the nature of a community given its membership. 2.1.1 Community Discovery Methods Members of natural groups in a network will tend to have a high density of connections between them, with lower connectivity between different groups. Discovering communities is typically viewed as a clustering problem, with specific techniques being more applicable to social networks. A large class of methods deal on a global scale, where every single vertex is assigned to a single community. An overview of these methods follows. 2.1.1.1 Graph Partition Techniques Bisection techniques attempt to partition the network into two relatively separate subgraphs. Several methods are effective to identifying a single bisection, but work less well on graphs containing many distinct communities. An external decision must be made to indicate when to stop bisecting, that is, and how many communities exist in the graph. Methods include:  Max Flow/Min Cut. These methods can produce good bisections, but make no guarantees about keeping both groups of similar size. Flake et al. give a min-cut algorithm based on min-cut trees which is able to produce an arbitrary number of clusters, and can be expanded to produce a hierarchical clustering.  Spectral Bisection. Spectral bisection techniques partition a graph based on the eigenvectors of its Laplacian. The Laplacian Q of a graph G is defined as
  • 17. 9 Q = D− A, where D is an n×n diagonal matrix with dv, v = d(v) and A is the adjacency matrix of G. The spectral bisection method finds the eigenvector corresponding to the second smallest eigenvalue λ2 , and bisects the graph on whether the eigenvector entry for a vertex is positive or negative. Λ2 is also called the algebraic connectivity of a graph. A smaller value indicates a better split into two groups [5].  Kernighan-Lin Algorithm. This heuristic algorithm attempts to greedily minimize the “external cost” of a partition, which is the sum of the cost of inter-partition edges. It starts with an initial (possibly random) partition, and determines the pair of vertices whose swap would produce the largest decrease in cost. This gives a sequence of vertex swaps which is then scanned to find the minimum. The procedure is then repeated with the new partition as the starting point, until convergence on a local minimum is achieved. 2.1.1.2 Hierarchical Clustering Hierarchical clustering techniques are driven by an application-specific similarity measure between the groups of vertices of a network [Scott 2000]. Techniques include:  Agglomerative: In this top-down approach, each vertex initially belongs to its own cluster. Clusters are merged incrementally in order of increasing cost. In single linkage clustering, the cost of merging two clusters depends upon the closest vertex pair spanning them. In complete linkage cluster, the cost is a sum of the distances of all vertex pairs spanning the clusters. Newman gives an algorithm based on modularity Q. Given a partition of the vertices, define a matrix e where eij is the fraction of edges in G between components i and j. Then Q is defined as At each step choose to merge the two clusters that cause the greatest increase in Q. Agglomerative clustering methods do not find peripheral members
  • 18. 10 reliably. An additional level of processing is needed to determine at which level the hierarchy defines the most meaningful communities.  Divisive: In divisive hierarchical clustering, the entire graph G begins as one cluster. Edges are removed to partition the cluster into smaller ones, as opposed to agglomerative where clusters are joined to larger clusters. Girvan and Newman gave an algorithm based on edge betweenness centrality. The edge with highest betweenness centrality is removed from the graph until no edges remain. Edge betweenness can be calculated in O (mn), giving a total computation time of O(m^2n). Clauset [2] et al. state that hierarchical structure is actually a defining component of social networks; sufficient to explain power law degree distributions, high clustering coefficients, and short path lengths (the small world phenomenon). The hierarchical random graph model is a dendogram, with probabilities at internal nodes. The probability of an edge between two leaves is equal to the value in their lowest common ancestor. This model produces networks exhibiting the properties of small- world networks. They also give a statistical based algorithm for inferring the most likely hierarchical random graph model from a given network. 2.1.2 Expanding an existing community The essential function of a community expansion method is to identify the most promising next member to add to the community. This is achieved by assigning a score to all entities in the network, and selecting the highest-scoring outside vertex to join the community. Given below is a description of several different possible scoring criteria to rank the selection:  Neighbour Count: The most obvious candidates for incorporation have many neighbours in the community. Basketball players tend to be associated with other basketball players, musicians with other musicians, etc.  Juxtaposition Count: One drawback of using a simple neighbour count criterion is that each neighbour is given the same weight, regardless of the strength of the relation. The edge weights defining the network are co-
  • 19. 11 occurrence frequencies of the given entity pair. Using such juxtaposition weights assigns more importance to neighbours that are more frequently associated in the text with in-community members.  Neighbour Ratio: A failing of such counting scores is that the status of ubiquitous entities gets artificially elevated. A frequent entity like “George Bush” has over a thousand neighbours in the graph, and hence will have neighbours from many communities. Say six of these neighbours are chemists. The raw neighbour count score would identify George Bush as more likely to be a chemist than John Dalton, an entity that has only 8 neighbours (5 of which are chemists). But if the vertex degree is factored in and a ratio used, Dalton becomes promoted to the most likely chemist.  Juxtaposition Ratio: The bias to ubiquitous entities is also present in juxtaposition counts. Edges to “George Bush” tend to have high weight, simply because of the total frequency of the entity. Using a ratio helps control for high-frequency vertices.  Binomial Probability: Using ratios has the problem of artificially elevating the importance of infrequent entities. An entity with 100 neighbours, 60 of which are chemists, would have a neighbour ratio of 0.6. But an entity with a single neighbour who happened to be a chemist would have a ratio of 1. Normalize for this by computing the probability PR[n, k] that an entity would happen to have at least k of its n neighbours from the group by chance. So: where p is the fraction of known-group members to network nodes. When Pr[n, k] is extremely low for an observed k in-group neighbours, than it can be reasoned that the entity must be a member of the community. Assume the community C, and the set of nodes adjacent to the community, B (each has at least one neighbour in C). At each step, one or more nodes from B are chosen and agglomerated into C. Then B is updated to include any newly discovered nodes.
  • 20. 12 This continues until an appropriate stopping criterion is satisfied. When the algorithms begin, C = {s} and B contains the neighbours of s: B = {n(s)}. The Clauset algorithm [2] focuses on nodes inside C that form a “border” with B: each has at least one neighbour in B. Denoting this set Cborder and focusing on incident edges, Clauset defines the following local modularity: where βij is the adjacency matrix comprising only those edges with one or more endpoints in Cborder and [P] = 1 if proposition P is true, and zero otherwise. Each node in B that can be agglomerated into C will cause a change in R, ∆R, which may be computed efficiently. At each step, the node with the largest ∆R is agglomerated. This modularity R lies on the interval 0 ≤ R ≤ 1 (defining R = 1 when |Cborder| = 0) and local maxima indicate good community separation. For a network of average degree d, the cost to agglomerate |C| nodes is O(|C|2 d) [6]. The LWP algorithm defines a different local modularity, which is closely related to the idea of a weak community [9]. Define the number of edges internal and external to C as Min and Mout, respectively: The LWP local modularity Mf is then: When Mf > 1/2, C is a weak community, according to the algorithm consists of agglomerating every node in B that would cause an increase in Mf, ∆Mf > 0, then
  • 21. 13 removing every node from C that would also lead to ∆Mf > 0 so long as the node’s removal does not disconnect the subgraph induced by C. (Removed nodes are not returned to B, they are never re-agglomerated.) Finally B is updated and the process repeats until a step where the net number of agglomerations is zero. The algorithm returns a community if Mf > 1 and s ∈ C. Similar to the Clauset method [2], the cost of agglomerating |C| nodes is O(|C|^2d). A number of approaches evaluated nodes based on the number of neighbors they had in and out of the community, adding nodes to the community when they optimized a function of a specific quantity. Bagrow [1] did this for a measure called outwardness, defined as the degree-normalized difference between neighbors inside and outside the community. The “outwardness” Ωv(C) of node v ∈ B from community C: where n(v) are the neighbours of v. In other words, the outwardness of a node is the number of neighbours outside the community minus the number inside, normalized by the degree. Thus, Ωv has a minimum value of −1 if all neighbours of v are inside C, and a maximum value of 1 − 2/kv, since any v ∈ B must have at least one neighbour in C. Since finding a community corresponds to maximizing its internal edges while minimizing external ones, agglomerate the node with the smallest Ω at each step, breaking ties at random. Figure 2.1 a: The community C is surrounded by a boundary of explored nodes B. This exploration implies an additional layer of nodes that are known only due to their adjacencies with B. Figure 2.1 b: Two nodes i and j in B, with Ωi = 2/3 and Ωj = −1. Moving node j into C will give improved community structure, compared to moving i.
  • 22. 14 Figure 2.1(a) and (b): Neighbour counting explanation. 2.2 Greedy Algorithm Greedy algorithm for maximising modularity Input: graph G = (V, E) Output: clustering C of G C ← singletons initialize matrix ∆ while |C| > 1 do find {i, j} with ∆i,j is the maximum entry in the matrix ∆ merge clusters i and j update ∆ return clustering with highest modularity
  • 23. 15 The greedy algorithm starts with the singleton clustering and iteratively merges those two clusters that yield a clustering with the best modularity, i. e., the largest increase or the smallest decrease is chosen. After n−1 merges the clustering that achieved the highest modularity is return. The algorithm maintains a symmetric matrix ∆ with entries ∆i,j := q (Ci,j) − q (C), where C is the current clustering and Ci,j is obtained from C by merging clusters Ci and Cj . Note that there can be several pairs i and j such that ∆i,j is the maximum, in these cases the algorithm selects an arbitrary pair. The pseudo-code for the greedy algorithm is given in Algorithm 1. An efficient implementation using sophisticated data-structures requires O (n^2*log n) runtime[1]. Note that, n−1 iterations is an upper bound and one can terminate the algorithms when the matrix ∆ contains only non-positive entries. This property is called single-peakedness. Since it is N P-hard to maximize modularity [1] in general graphs, it is unlikely that this greedy algorithm is optimal. In fact, in the graph family, the above greedy algorithm has an approximation factor of 2, asymptotically. Furthermore, we point out instances where a specific way of breaking ties of merges yield a clustering with modularity of 0, while the optimum clustering has a strictly positive score. Modularity is defined such that it takes values in the interval [−1/2, 1] for any graph and any clustering. In particular the modularity of a trivial clustering placing all vertices into a single cluster has a value of 0. We use this technical peculiarity to show that the greedy algorithm has an unbounded approximation ratio.
  • 24. 16 2. 3 Drawbacks of Greedy Algorithm  Asymptotic growth of value of a metric implies a strong dependence on the size of the network and the number of modules the network contains [6].  Resolution limit is a problem where communities of certain small size are merged into larger ones [6]. A classic example where modularity cannot identify communities of small size is a cycle of m cliques. Here maximum modularity is obtained if two neighbouring cliques are merged [4].  Degeneracy of solution is a problem where a community scoring function (e.g. modularity) admits multiple distinct high-scoring solutions and typically lacks a clear global maximum, thereby, resorting to tie-breaking [6]. 2.4 Drawbacks of PageRank Algorithm Modularity is a property of the network that measures when the division is good, in the sense that there are many edges within the community and only a few between them. In modularity based algorithm's, each node of the graph is considered as an individual community and the communities are joined iteratively based on the increase in modularity caused by their joining. The ones producing maximum change in modularity are joined. There are few drawbacks associated with modularity based methods such as they require information regarding the entire structure of the network which is not possible to determine in case of vast real world networks. Also modularity optimization methods are not able to determine the overlapping communities [1]. In order to detect overlapping communities clique percolation can be used . Clique percolation is based on the assumption that a community consists of fully connected subgraphs and detects overlapping communities by searching for adjacent cliques. But it is a hard method to implement well due to difficulty of producing intermediate representations [4] of percolating structures.
  • 25. 17 CHAPTER 3 IMPLEMENTATION 3.1 Finding potential members The main aim is finding all the potential members that can be added into the community. The algorithm basically considers all the neighbours of the given community .All the neighbours are extracted into ADJ[] array. The algorithm grows in a dynamic fashion. The member with the highest pagerank is added into the community. Then a fresh set of neighbour is extracted into the ADJ[] array considering the new member added to the community and the same process is repeated. A flag array CHECK_ARR[] is given to check if the neighbouring member has the same set of interest as the community members. Figure 3.1(a): Finding potential neighbour step one
  • 26. 18 Figure 3.1(b): Finding potential neighbour step two 1N : 1-hop neighbour 2N: 2-hop neighbour C: Community Figure 3.1(a) and 3.1(b) is a snapshot of a very big graph (the input, a webgraph). The initial set of neighbours are (P, Q, R, S) Assume that the PageRank(R) is the highest among all the neighbours. So, the node R is added to the community. Now the next list of 1N (one hop) neighbours are considered, that is, neighbours (P, Q, T, S) are considered. So the algorithm proceeds in a dynamic fashion. At every step it checks if the neighbour has the same set of interest as that of the community. If yes, the neighbour node will be added to the ADJ[] array, else it will pass on to the next hop neighbour. Figure 3.2 depicts a flowchart that shows how the expansion algorithm executes to find potential members and add them to the community.
  • 27. 19 Figure 3.2: Flowchart depicting the algorithm 3.2 Improved PageRank The proposed algorithm is based on mean value of page ranks of all web pages with performance advantages over the traditional Page Rank algorithm. A novel approach for reducing the number of iterations performed in Page Rank algorithm to reach a convergence point:  Initially assume PAGE RANK of all web pages to be any value, let it be 1.  Calculate page ranks of all pages by following formula PR(A) = .15/N + .85 (PR(T1)/C(T1) +PR(T2)/C(T2) + ....... + PR(Tn)/C(Tn)) o T1 through Tn are pages providing incoming links to Page A o PR(T1) is the Page Rank of T1 Construct the neighbour set of the given community. Find the neighbour with maximum PageRank. Add the neighbour to the community. Construct a fresh neighbour set considering the neighbours of the newly added member. Repeat steps 2 and 3 until stopping criteria is met. Make sure the neighbour has same set of interest as that of Community, else, consider the next one hop neighbour.
  • 28. 20 o PR(Tn) is the Page Rank of Tn o C(Tn) is total number of outgoing links on Tn o N is the total number of nodes available in the graph.  Calculate mean value of all page ranks by following formula :- o Summation of page ranks of all web pages / number of web pages.  Then normalize page rank of each page o Norm PR (A) = PR (A) / mean value o Where norm PR (A) is Normalized Page Rank of page A and PR (A) is page rank of page A  Assign PR(A)= Norm PR (A)  Repeat step 2 to step 4 until page rank values of two consecutive iterations are same. o The pages which have the highest page rank are more significant pages. When running the original PageRank algorithm, the values of individual PageRanks of webpages keeps oscillating about their final value. Saturation is reached after a number of iterations when the value converges to a single value according to a convergence factor. In the proposed improved PageRank algorithm, this oscillation is minimized by normalizing the PageRank values received after every iteration thereby bringing the current value closer to the saturation point every time. The procedure for the execution of proposed improved PageRank algorithm has been neatly depicted as a flowchart in figure 3.3.
  • 29. 21 Figure 3.3: Flowchart depicting Improved PageRank Algorithm Input:  Web graph with 50000 vertices (number from 0 to 49999) and 50000 edges.  Nodes that are part of the community(around 100 nodes)
  • 30. 22 Figure 3.4: Input webgraph (50000 vertices and 50000 edges) Figure 3.5: Existing community members Output:  PageRank of each and every node calculated using the improved page rank algorithm.  Potential members that can be added to the community, decided by using the PageRank values obtained above.
  • 31. 23 Figure 3.6: Calculating PageRank using the existing code. Figure 3.7: Calculating PageRank using improved algorithm.
  • 32. 24 0 10 20 30 40 50 60 70 80 30 40 50 60 70 80 90 100 NumberofIterationstofindPageRank Number of Nodes (millions) PageRank Improved PageRank Figure 3.8: Using the PageRank values got above, the potential members that can be added to the community are determined. c) Performance improvement graphs – Traditional PageRank vs Improved PageRank Figure 3.9: Graph showing analysis between PageRank and Improved page rank.
  • 33. 25 CHAPTER 4 PERFORMANCE ANALYSIS 4.1 Neighbour Counting algorithm Q = 0.1152 O(5) = 0.75 O(6) = 0.25 Q = 0.1308 O(5) = 0.20 O(8) = 0.00 Q = 0.1309 O(5) = -0.167 O(7) = 0.00 Q = 0.2051 O(7) = -0.2857 Q = 0.2871 Figure 4.1: Neighbour counting algorithm steps
  • 34. 26 Figure 4.1 show the process of employing neighbour counting algorithm to expand a given community. Also, the modularity of the new community is calculated. The next possible addition to the community is chosen by the property of outwardness(O) [1]; the immediate neighbour with the least outwardness is the most potent member in the community. 4.2 Greedy algorithm Q = 0.1152 O(5) = 0.75 O(6) = 0.25 Q = 0.1367 O(6) = 0.20 O(7) = 0.00 O(8) = 0.40 Q = 0.1758 O(6) = 0.167 O(8) = 0.00 Q = 0.2207 O(6) = -0.143 Q = 0.2871 Figure 4.2: Greedy algorithm steps
  • 35. 27 The PageRank of all the nodes is calculated using the greedy algorithm. The expanded community in different iterations are depicted in figure 4.2. Adding more members to the community is decided based on the modularity value of the attained subgraph. The outwardness(O) of the options are listed. 4.3 PageRank Algorithm Q = 0.1152 O(5) = 0.75 O(6) = 0.25 Q = 0.1308 O(5) = 0.20 O(8) = 0.00 Q = 0.1309 O(5) = -0.167 O(7) = 0.00 Q = 0.2480 O(5) = -0.4285 Q = 0.2871 Figure 4.3: Proposed improved PageRank algorithm steps
  • 36. 28 The network shown in figure 4.3 depicts the expansion process using proposed improved pagerank algorithm. 4.4 Comparing performances of PageRank and the proposed Improved PageRank The existing PageRank algorithm and the proposed Improved PageRank algorithm were executed to find the PageRank of all the nodes in the previous graph. Each graph from figure 4.4(a) to 4.4(e) represents an iteration of the process. Figure 4.4(a): Graph showing iteration 1 PageRank values Figure 4.4(b): Graph showing iteration 2 PageRank values
  • 37. 29 Figure 4.5(c): Graph showing iteration 1 PageRank values Figure 4.4(d): Graph showing iteration 5 PageRank values
  • 38. 30 Figure 4.4(e): Graph showing iteration 2 PageRank values Figure 4.5 and 4.6 are graphs which show a comparison between the three algorithms which were tested based on two different characteristics: modularity and outwardness. Figure 4.5: Modularity- Comparison between the three algorithms.
  • 39. 31 Figure 4.6: Outwardness- Comparison between the three algorithms. 4.5 Inference Modularity is found to be continually increasing for all three algorithms. Since the greedy structural optimization method relies on adding members to the group based on maximizing modularity, it is found to show better results than neighbour counting. Neighbour counting proceeds by adding nodes with least outwardness value and hence the resulting community is found to have lesser outwardness than greedy structural optimization and PageRank algorithm. Finally, it has been found that in the longer run, PageRank algorithm deduces a graph with higher modularity and comparable outwardness to both greedy structural optimization and neighbour counting algorithm. This makes the final result more efficient thereby letting the system add more potent neighbours into the given community. The proposed improved PageRank algorithm reduces the number of iterations taken to reach saturation in comparison to traditional pagerank algorithm. This reduces the time taken to propose new members to be added to the community thereby improving the efficiency in expanding a given community using seed sets. This effect is more pronounced when the data set is of the order of one lakh nodes.
  • 40. 32 CONCLUSION AND FUTURE WORK The seed set expansion problem has its roots in a number of overlapping areas, including the problem of identifying central nodes in social networks [3] and finding related and/or important Web pages from an initial set of query results [3]. In particular, the PageRank algorithm broadened from its initial focus on Web search [9] to also include methods for finding nodes “similar” to an initial root, by starting short random walks from the root and seeing which other nodes were likely to be reached [3]. The seed set expansion problem has been gaining visibility as a general-purpose framework for identifying members of a networked community from a small set of initial examples. But subtle trade-offs in the formulation and underlying methods can have a significant impact on the way this process works, and in this project, several such principles have been identified about the relative power of different expansion heuristics, and the structural properties of the initial seed set. The investigations have involved analyses of datasets across diverse domains as well as theoretical trade-offs between different problem formulations. There are a number of interesting directions for further work. In particular, the power of PageRank-based methods raises the question of whether these are indeed the “right” algorithms for seed set expansion, or whether they should be viewed as proxies for a richer set of probabilistic approaches that could yield strong performance. Second, the damping factor which is assumed to be a constant value can be changed and it can be said that over different seed sets tampering with varying values of the damping factor could lead to anomalies and special cases which need to be carefully studied; a richer understanding of the seed sets that lead to the most effective expansions to a larger community could provide useful insights for the application of these methods. And finally, as noted earlier, nodes in a network tend to belong to multiple communities simultaneously, and a robust way of expanding several overlapping communities together is a natural question for further study.
  • 41. 33 REFERENCES [1] James P Bagrow. “Evaluating local community methods in networks “. Journal of Statistical Mechanics: Theory and Experiment, pages:15-19, 2008. [2] Aaron Clauset. Finding local community structure in networks. Physical review E, 72(2):026132, 2005. [3] Isabel M. Kloumann and Jon M. Kleinberg. "Community membership identification from small seed sets”. KDD, 2014. [4] Andrew Mehler and Steven Skiena. “Expanding network communities from representative examples”. ACM Transactions on Knowledge Discovery from Data (TKDD), pages:14-19, 2009. [5] Jaewon Yang and Jure Leskovec. “Defining and evaluating network communities based on ground-truth”. In In MDS ’12, page 3. ACM, 2012. [6] J. Leskovec, K. J. Lang, and M. Mahoney. “Empirical comparison of algorithms for network community detection”. In WWW, pages 631-640, New York, USA, 2010. [7] Reid Andersen, Fan Chung, and Kevin Lang. “Local graph partitioning using pagerank vectors”. In Foundations of Computer Science,pages 475-486, 2006. [8] Alan Mislove, Bimal Viswanath, Krishna P Gummadi, and Peter Druschel. “You are who you know: inferring user profiles in online social networks”. In In WSDM ’10, pages 251–260. ACM, 2010. [9] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. “The pagerank citation ranking: Bringing order to the web”. 2011.
  • 42. 34 BIBLIOGRAPHY [1] Community Detection in Social Media – www.slideshare.com [2] Empirical Comparison of Algorithms in Network Community Detection– dl.acm.org [3] Louvain Method for Community Detection– perso.uclouvain.be [4] Applications of Community Detection – royalsocietypublishing.org [5] IEEE Xplore – ieeexplore.ieee.org [6] Wikipedia – wikipedia.org [7] SNAP – Stanford Database Collection
  • 44. 36 APPENDIX A SNIPPETS OF CODES  Improved PageRank algorithm void Table::pagerank() { vector<size_t>::iterator ci; // current incoming double diff = 1; size_t i; double sum_pr; // sum of current pagerank vector elements double dangling_pr; // sum of current pagerank vector elements for dangling // nodes unsigned long num_iterations = 0; vector<double> old_pr; size_t num_rows = rows.size(); if (num_rows == 0) { return; } pr.resize(num_rows); pr[0] = 1; if (trace) { print_pagerank(); } while (diff > convergence && num_iterations < max_iterations) { sum_pr = 0; dangling_pr = 0; for (size_t k = 0; k < pr.size(); k++) { double cpr = pr[k]; sum_pr += cpr; if (num_outgoing[k] == 0) { dangling_pr += cpr; } } if (num_iterations == 0) { old_pr = pr; } else { for (i = 0; i < pr.size(); i++) { old_pr[i] = pr[i] / sum_pr; } } sum_pr = 1; double one_Av = alpha * dangling_pr / num_rows; double one_Iv = (1 - alpha) * sum_pr / num_rows; diff = 0; for (i = 0; i < num_rows; i++) {
  • 45. 37 double h = 0.0; for (ci = rows[i].begin(); ci != rows[i].end(); ci++) { double h_v = (num_outgoing[*ci]) ? 1.0 / num_outgoing[*ci] : 0.0; if (num_iterations == 0 && trace) { cout << "h[" << i << "," << *ci << "]=" << h_v << endl; } h += h_v * old_pr[*ci]; } h *= alpha; pr[i] = h + one_Av + one_Iv; diff += fabs(pr[i] - old_pr[i]); } num_iterations++; if (trace) { cout << num_iterations << ": "; print_pagerank(); } } cout<<"n.......num_iterations:"<<num_iterations<<"n"; }  Adding potential neighbours struct pair1 { int vertex; double pagerank; }; while(infile_proc1 >> a >> b >> c) { if(check_arr[atoi(a)]) { list<int>:: iterator i; for(i=adj[atoi(a)].begin(); i!=adj[atoi(a)].end(); ++i) { if(!vis[*i] && !check_arr[*i]) { v[n].vertex=*i; v[n].pagerank=value[*i]; vis[*i]=1; total_sum+=value[*i]; n++; } } } } for(int j=0;j<n;j++) for(int k=0;k<n;k++) { if(v[j].pagerank > v[k].pagerank) { pair1 temp = v[j]; v[j]=v[k]; v[k]=temp; } }
  • 46. 38 APPENDIX B GLOSSARY OF TERMS [1] Conductance In graph theory the conductance of a graph G=(V,E) measures how "well-knit" the graph is: it controls how fast a random walk on G converges to a uniform distribution. The conductance of a graph is often called the Cheeger constant of a graph as the analog of its counterpart in spectral geometry [8]. [2] Modularity Modularity is one measure of the structure of networks or graphs. It was designed to measure the strength of division of a network into modules (also called groups, clusters or communities). Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. [3] Ground-Truth Community Generally, after communities are identified in a given network, the essential next step is to interpret them by identifying a common external property [5] that all the members share and around which the community organizes . Thus, the goal of network community detection is to identify sets of nodes with a common (often external/latent/unobserved) property based only the network connectivity structure. A “common property” can be common attribute, affiliation, role, property, or function.. A distinction is made between network communities and groups. A community is defined structurally (i.e., a set of nodes extracted by the community detection
  • 47. 39 algorithm), while a group is defined based on nodes sharing a property around which the nodes organize in the network (e.g., belonging to a common interest based group, sharing common affiliation) Using the ground-truth communities allows for quantitative and large-scale evaluation[5] and comparison of different community detection methods. Such ability represents a significant step forward as the field can move beyond the current standard of anecdotal evaluation of communities to comprehensive evaluation of the performance of community detection methods. Ground-truth communities are structurally most similar to the communities discovered by random walk method[5]. [4] Ego Networks Ego networks consist of a focal node ("ego") and the nodes to whom ego is directly connected to (these are called "alters") plus the ties, if any, among the alters. Of course, each alter in an ego network has his/her own ego network, and all ego networks interlock to form The human social network the denser the ties in an ego network, the stronger the ties, and the more insular the ego network and also the more homogeneous. Typical measures:  Homophily  Size  Average strength of ties  Heterogeneity  Density  Composition (e.g., % women, %whites, etc.)  Range: Substantively defined as potential access to social resources often defined as diversity of alters based on weak ties argument, density is thought of as inverse measure of range size and heterogeneity also seen as measures of range.
  • 48. 40 [5] Outwardness Outwardness of a node is the number of neighbours outside the community minus the number inside, normalized by the degree [6]. Thus, Ωv has a minimum value of −1 if all neighbours of v are inside C, and a maximum value of 1 − 2/kv, since any v ∈ B must have at least one neighbours in C [4]. Since finding a community corresponds to maximizing its internal edges while minimizing external ones, agglomerate the node with the smallest Ω at each step, breaking ties at random.