Data mining.pptx

Presented by
Mousumi Dhara
Department of Computer Engineering
Indian Institute of Technology
(Banaras Hindu University)
Varanasi-221005, India
October -2012

Main Contributions of the thesis
 Aspiration Criteria Based Graph Clustering with Greedy Initialization
algorithm improves Restricted Neighbourhood Search Clustering (RNSC)
(2004)
 New Overlapping Community Detection Algorithm improves Density-
based Shrinkage (Den Shrink) (2010)
 Applications of the new algorithms to show performance improvement on
 Synthetic random graphs and LFR (Lancichinetti–Fortunato–Radicchi)
Benchmark weighted networks
 Bioinformatics (protein-protein interaction graphs)
 Community Detection in Social Networks
 Overlapping Community Detection in social network
 Medical Image Segmentation

Outline of today’s Presentation
 Introduction
 Brief Review of Literature
 Critical Observations
 Objective of the present Investigation
 Algorithmic Description and Experimentation on Random graphs
 New Algorithm’s Application on Protein-Protein Interaction ,Social
Networks
 Application on Medical Image segmentation
 Concluding remarks and Scope of Future Work
 References

Introduction
 Cluster Analysis is the mathematical study of methods for recognizing natural groups
within a class of entities.
 A cluster is comprised of a number of similar objects grouped together. The following
definitions of a cluster is documented here.
 “A cluster is a set of entities which are alike, and entities from different clusters are not
alike.”
 “A cluster is an aggregation of points in the test space such that the distance between any
two points in the cluster is less than the distance between any point in the cluster and any
point not in it.”
 “Clusters may be described as connected regions of a multi-dimensional space
containing a relatively high density of points, separated from other such regions by a
region containing a relatively low density of points.”
 Clustering and classification have been useful and active areas of machine learning
research that promise to help us cope with the problem of complex networks.
 Basically, the goal of clustering is to separate a given group of data items (the data set)
into groups called clusters such that items in the same cluster are similar to each other
and dissimilar to the items in other clusters.

Contd..
 Graphs are important and effective mathematical constructs for
modelling relationships and structural information.
 In the case of graph clustering, we want to find a decomposition of
the vertex set into natural subsets that are highly intra-connected
and sparsely inter-connected.

Contd..
 Another definition of Graph clustering is as follows
like Graphs are structures formed by a set of vertices
(also called nodes) and a set of edges that are
connections between pairs of vertices.
 Graph clustering is the task of grouping the vertices
of the graph into vertex groups taking into
consideration the edge structure of the graph in
such a way that there should be many edges within
each cluster and relatively few between the
clusters.
 Graph-clustering methods are also studied in the
large research area labelled pattern recognition.

Contd..
A Random Geometric Graph with 700 nodes
A random geometric graph (RGG) is denoted as G (n, r) where n is the number of
nodes. The graph is constructed as inserting n points uniformly in terms of
distribution at random on the unit square (or on the unit disk) and connecting two
points if their Euclidean distance is at most the radius r (n).

Contd..
Community Detection Issues
 Graphs representing real systems are not regular like, e.g.,
lattices.
 In a random graph, the distribution of edges among the vertices
is highly homogeneous
 Real networks are not random graphs, as they display big
inhomogeneities, revealing a high level of order and
organization.
 Communities, also called clusters or modules, are groups of
vertices which probably share common properties and/or play
similar roles within the graph.

Contd..
 Identifying modules and their boundaries allows for a classification of
vertices, according to their structural position in the modules.
 So, vertices with a central position in their clusters, i.e. sharing a large
number of edges with the other group partners, may have an important
function of control and stability within the group; vertices lying at the
boundaries between modules play an important role of mediation and
lead the relationships and exchanges between different communities.
 Another important aspect related to community structure is the
hierarchical organization displayed by most networked systems in the
real world.

Review of Literature
 Graph clustering Related Algorithms
 Community Detection Algorithms
 Finding Overlapping community structure Related Algorithms
 Region Adjacency Graph Based Image Segmentation

Contd..
Graph Clustering Algorithms
 A.D.King (2004)
Proposed RNSC, which is a cost based clustering method and performs local search
iteratively to obtain optimum clustering in an efficient way. RNSC is a stochastic
technique which uses restricted neighbourhood search concept.
 Fred Glover (1995)
Tabu search concept was first proposed by Glover. It is a meta-heuristic, one that guides
local search heuristics. The idea behind it is to allow cost-based local search algorithms
to enter, then leave local minima by preventing the search from retracing its steps and
settling in a local minimum.
 S. M. van Dongen, (2002)
A graph-clustering algorithm (MCL) incorporating the idea of performing a random walk
on the graph to identify the more densely connected subgraphs is presented. MCL is an
efficient clustering method in weighted graphs, based on the prototype of stochastic flow
simulation technique.

Contd..
 Harel and Koren (2001)
The idea of random walks is also used in Harel and Koren, 2001, but only for
clustering geometric data.
 Brandes et al., (2003) using unweighted graphs, we concentrate on indices and
algorithms that focus on the relation between the number of intracluster and
intercluster edges.
 In Vempala et al., 2000, some indices measuring the quality of graph clustering
are discussed. Conductance, an index concentrating on the intracluster edges is
introduced and a clustering algorithm that repeatedly separates the graph is
presented.
 Hartuv and Shamir (2000)
A purely graph-theoretic approach using this connection, more or less directly, is
the recursive minimum cut approach presented in Hartuv and Shamir, 2000.
Hartuv and Shamir, among others, proposed a clustering model based on high
cluster connectivity.

Contd..
Community Detection Algorithm
 Newman and Girvan (2004) Finding and evaluating community structure in networks.
Very recently, the physics community presented techniques based on centralities and
statistical properties. For example, an algorithm that iteratively prunes edges based on
betweenness centrality was introduced as a clustering technique .
 Clauset et al., (2004)
A related quality measure named modularity was presented. It evaluates the significance of
clustering with respect to the graph structure by considering a random rewiring of the edge
set.
 Nascimento and Eades (2001)
They applied simulated annealing to graph clustering, but the focus of their work was the
integration of user participation in graph clustering algorithms.
 Hoos and Stutzle (1999)
Hoos and Stutzle compare systematic search algorithms with stochastic local search
algorithms for 3-SAT, the propositional satisfiability problem with three literals in each
clause.
 Derényi et al. (2005)
Clique percolation in random networks.
 Palla et al.,2006
The Critical Point of k-Clique Percolation in the Erd´o´s–Re´nyi Graph,
 P. Pollner et al.,(2006)
Preferential attachment of communities.

Clique Percolation Method by Palla et al.,2005
Illustration of the k-clique communities at k = 4.

Contd..
 Vincent D Blondel et al.(2008)
Fast unfolding of communities in large networks
 A. Lancichinetti, et al. (2009)
Detecting the overlapping and hierarchical community structure in complex networks.
 S. Gregory (2010)
Finding overlapping communities in networks by label propagation.
 C. Lee et al.(2010)
Detecting highly overlapping community structure by greedy clique expansion.
 A. Lancichinetti et al.(2011)
Finding statistically significant communities in networks.
 M. Rosvall, C. Bergstrom (2008)
Maps of random walks on complex networks reveal community structure.
 Jianbin Huang et al. (2011)
Density-based shrinkage for revealing hierarchical and overlapping community structure in
networks.

Contd..
Finding Overlapping community structure Related Algorithms
 Zhihao Wu et al. (2011)
Efficient overlapping community detection in huge real-world networks.
 Huawei Shen et al. (2008)
Detect overlapping and hierarchical community structure in networks.
 Di Jin et al. (2011)
A Markov random walk under constraint for discovering overlapping communities in
complex networks.
 S. Gregory (2007)
An algorithm to find overlapping community structure in networks
 S. Zhang et al. (2007)
Identification of overlapping community structure in complex networks using fuzzy c-
means clustering.
 H.-W. Shen et al. (2009)
Quantifying and identifying the overlapping community structure in networks.

Contd..
Region Adjacency Graph Based Image Segmentation
 Abraham Duarte et al. (2006)
Improving image segmentation quality through effective region merging using a hierarchical
social metaheuristic.
 K. Haris, et al. (1998)
Hybrid Image Segmentation Using Watersheds and Fast Region Merging.
 S.E Hernandez, K.E. Barner (2000)
Joint region merging criteria for watershed-based image segmentation.
 Sarkar et al. (2000)
A simple unsupervised MRF model based image segmentation approach
 M. Sonka et al., (1999)
Image processing, analysis and machine vision.
 Gothandaraman (2004)
Hierarchical image segmentation using the watershed algorithm with a streaming
implementation.
 Duarte et al. (2004)
Top-Down Evolutionary Image Segmentation using a Hierarchical Social Metaheuristic.
 F. Fernandez et al. (2003)
A software pipelining method based on a hierarchical social algorithm.

Critical Observations
Graph clustering Algorithm
 RNSC(Graph Clustering with Restricted Neighbourhood Search) :
1. RNSC works on undirected and un-weighted graphs.
2. Complete re-generation of Candidate list.
3. In terms of the un-weighted cost numerator, cross edges and absent in-cluster
edges are both considered equally bad.
4. Evaluation of Cost measurement is not accurate. The induced effects which
are produced during clustering on source and target cluster , are not considered.
5. Complex Scaled cost computation.
6. The memory requirement for RNSC is O (n^2).
7. The complexity of a move in the naive cost function is O (n), which is
the size of the restricted neighbourhood of a move M.

Critical Observations
 MCL(Markov Clustering):
1. Lack of scalability: MCL is slow and that has been noted by data
mining researchers before. The Expand step, which involves matrix
multiplication, is very time consuming in the first few iterations when many
entries in the flow matrix have not been pruned out and is the main component
of the overall running time.
2. Fragmentation of output: MCL tends to produce too many
clusters. For example, on the yeast protein-protein interaction network of 4741
nodes, MCL outputs 1416 clusters.
3.The expansion step of MCL has complexity O(n3)
4.The inflation has complexity o(n2).

Contd..
• Multiple cut:
1. The restriction on partition sizes in the Kernighan-Lin
approach can lead to poor Performance
2. Such as the choice of the value of k, i.e., the number of
clusters, it can be done by choosing the value of k that provides the largest
Eigengap.
• Local Search Techniques:
1. Local search (Stochastic) techniques cannot enumerate the
clusterings of a graph (which, in any case, are prohibitively large in
number) and they are therefore useless for proving nonexistence of
clusterings with a given property.

Contd..
• Tabu Search:
1. If tabu states are clusterings, then there is some computational
difficulty in checking that a state is not tabu. Of course, it would be
unreasonable to store data on every clustering, as evidenced by the size of
the Bell numbers.
2. While central to the tabu search method, tabus are sometimes
too powerful. They may prohibit attractive moves, even when there is no
danger of cyclng, or they may lead to an overall stagnation of the search
process.
3. It is thus necessary to use algorithmic devices that will allow
one to revoke (cancel) tabus. These are called aspiration criteria.

Contd..
 Genetic Algorithms:
1. The application of this algorithm was very slow (e.g. over an hour
for a graph with 153 vertices and 103 directed edges).
2. Their implementation was not likely optimized to exploit
properties of the graph clustering problem.
 Simulated Annealing:
1. Simulated annealing is not without its shortcomings. The algorithm
requires many moves to be made, and progress is intentionally made very
slowly. Further, the algorithm is, in its general form, memoryless. That is, it does
not learn from its previous work.

Contd..
 Palla et al.
1. Overlapping communities have received a lot of attention . However,
there is still no consensus about a quantitative definition of the concept
of overlapping community, and most definitions depend on the method
adopted.
2. Intuitively, one would expect that clusters share vertices lying at their
borders, and this idea has inspired most algorithms.
3. However, clusters detected with the Clique Percolation Method often
share central vertices of the clusters, which make sense in specific
instances, especially in social networks. So, it is still unclear how to
characterize overlapping vertices.
 BGLL
The limitation of the method for the experiments that we performed was the
storage of the network in main memory rather than the computation time.

 Density-based shrinkage for revealing hierarchical and
overlapping community structure in networks
Lack of overlapping community structure formation in large
complex networks.
 Clauset et al. on hierarchical random graphs
1.The information given by a dendrogram may become
redundant and confusing when the graph is large, as then there is a
big number of partitions.
2.The algorithm should allow researchers to analyze even
larger networks with millions of vertices and tens of millions of
edges using current computing re- sources, and we look forward
to seeing such applications.

Objective of the present Investigation
 To alleviate some of the major problems listed before for both clustering and
community detection algorithms, it is necessary to make an important attempt of
designing an effective technique which is basically an improvised mathematical
modelling, established from previous issues. So that a robust clustering algorithm is
achieved.
 Compare the performance of the proposed algorithm with existing algorithms.
 To apply the designed algorithm to solve complex network problems and those are
specified below.
 Random Networks( Erdos-Renyi, Scale-free, Random Geometric Graph)
 LFR benchmark weighted networks
 Bioinformatics ( PPI Network)
 Social Networks,
 Medical Image Segmentation

Preliminaries and Definitions
Tabu search
 Tabu search (TS) (F. Glover, 1989, 1990), a meta-
heuristic technique, controls a local heuristic search
method in exploring the solution space beyond local
optimality.
 The basic idea of tabu search is stood on the premise of
problem solving, in order to qualify as intelligent, must
incorporate adaptive memory and responsive
exploration.
 The basis for tabu search may be defined as follows.
Assumed a function f (x) to be enhanced over a set X,
TS originates in the same method as usual local search,
progressing iteratively from one point (solution) to
another until a preferred termination criterion is
fulfilled.

Short Term Memory and its Accompaniments
 The concept of recency-based memory is to keep track of the
solution attributes that have changed during the recent past by the
idea based on short term memory.
 Recency-based memory is exploited by assigning a tabu-active
description of selected attributes that have occurred in solutions
recently visited.
 Solutions that comprise tabu-active elements, or particular
combinations of these attributes, are those that become tabu.

Aspiration Levels
 Aspiration level is introduced the flexibility in tabu search at
various levels of restrictions. The tabu status of a solution is not an
absolute, but can claim superiority if certain conditions are met,
conveyed in the form of aspiration levels.
 In effect, these aspiration levels provide thresholds of
attractiveness that govern whether the solutions may be considered
acceptable in spite of being classified tabu.
 Clearly, a solution better than any previously seen justifies to be
considered acceptable.

Candidate List Strategies
 For situations where N* (x) is large or its elements are
expensive to estimate, candidate list strategies are used to
confine the number of solutions inspected on a given iteration.
 The significance of TS attaches to selecting elements sensibly,
effective rules for creating and estimating good candidates are
critical to the search process.

What is a Move?
 Moving a single vertex from one cluster to another, possibly
emptying a cluster or creating a singleton cluster in the process.
RNSC uses a constant number of clusters, and the clusters may be
empty.
 Therefore in a graph G = (V,E), the neighbourhood of a clustering C,
denoted N(C), consists of |V |.(NC - 1) moves.
 NC is the number of clusters. This is because there are |V|
vertices, each of which can be moved to any cluster that it does not
occupy, i.e. one of NC -1 clusters.
30

Move Types
 Global Move: -The concept of the global move is to improve the clustering results and
reducing the cost of clustering. It acts like a random move with near-optimal cost of
clustering i.e. the maximum decrease in current clustering’s cost resulting from an
available move. The change on the cost of clustering results in a move lies in [1-|V|,
|V|-1] for cost functions (Naive, Scaled).
 Diversification (Random) Move: - In order to avoid getting stuck into the higher local
minima; some random moves are made during clustering to get rid of it and hopefully;
it can resolve the issue and get a lower minimum.
 One important direction on that move type is shuffling diversification that
provides some better conception of clustering. In that situation, during each move,
every random vertex is placed to random cluster where both the vertex and the
cluster are selected from a uniform distribution.
 Intensification Move: - The concept of selecting few moves from a restricted
neighbourhood (a recommended subset of the current clustering's neighbourhood in
the search space) to reduce the search for a good move by only allowing moves within
a certain subset of those possible moves, is called intensification.

Restricted Neighborhood search clustering (RNSC)
 RNSC (King et al., 2004) is a local search meta-heuristic technique which is
used to minimize the cost of clustering in the solution space.
 According to Stijn van Dongen, the vertex-wise performance criteria for
clustering of unweighted graphs as the sum of the coverage measure taken on
each vertex.
 In RNSC, a simple integer-valued cost function (called the naive cost
function) is used as a pre-processor to produce initial clustering results on a
graph.
 After that to evaluate the low-cost clustering result, a more expressive (but
less efficient) real-valued cost function (called the scaled cost function) is
applied.
 The scaled function tries to optimize the output from naive function and reach
to the global optimal solution.

Description of proposed algorithm ACOGCT
 The proposed algorithm is developed on taking advantage of the
intellectual conception of tabu search.
 It explores some advanced concepts in tabu search to design a
more significant and optimal technique to produce better
clustering results.
 Basically, the proposed algorithm is modified form of RNSC in
terms of the cost function, generation, update of move list and
move selection criteria. In the following section, a brief overview
of our algorithm is discussed and highlighted the key features.

Comparative features of RNSC and proposed
algorithm
 Few positive features are pointed out here to lay the foundation of
the algorithm better compare to RNSC.
 Scaled cost evaluation is O (n) in RNSC. This can easily be done
in O (1) time if the information about current node, and its cluster
contribution are pre-computed.
 RNSC might tabu some very good moves based on the tabu
criteria. Instead, in the proposed algorithm, aspiration criteria
serves the sole purpose of avoiding tabu (based on the relative
cost of the best non-tabu move).
 Regeneration of all possible moves to select the best move, each
time before it is applied in RNSC.

Comparative features of RNSC and proposed algorithm
 Moves are considered only if the target node has neighbouring
nodes in the destination cluster (moves to empty cluster are the
only exception to this rule).
 The effect of a move for any cost scheme considered in RNSC is
not exact in nature. They ignore the effect of moving on nodes
other than the target node.

Features retained in our proposed algorithm
 Short-term memory considerations using Tabu criteria are
actively used.
 As in the case of RNSC, diversification moves are applied when
in the recent past no good solution was found.

Create an initial clustering solution
The proposed algorithm uses a greedy initial clustering instead of
random clustering.
Due to this clustering, most of the nodes are placed such a way
that there are good chances for some of its neighbours residing in
the same cluster.
 Select the node with the highest degree with no cluster assigned
yet.
 Add node to a new cluster and its unassigned neighbours are
also put into the same cluster.
 If all nodes haven’t been assigned yet then go back to the initial
step I.

Generation of Move List
 Move list (candidate list) consists of a list of neighbourhood
moves i.e. moves to the state differing by just position of one
node.
 The proposed algorithm reduces the size of neighbourhoods by
a high magnitude, not including the moves which meant node
being moved to a cluster with zero neighbouring nodes (except
for empty cluster moves).
 For Some nodes moves are not included if they have total
edge weight above a certain minimum value, and a good
percentage of it is already shared with the neighbours in the
same cluster.

Move selection
 The idea behind the selection of a move similar to the
technique used in RNSC, where type of move is decided
based on the previous clustering costs or improvements.
 Diversification move is executed when there has been no
improvement in the best cost of the clustering over the last
specified interval of time otherwise a normal move (in our
case tabu move) is applied.
 Diversification when run shuffles the current clustering by
the specified amount of diversification period and
frequency, even if it means a significant increase in the
current cost.
 This helps us to get out of any local minima where we
might have been stuck in and explore some new possible
clusterings.

Move selection
 If there is no need for diversification, best move from the
candidate list is selected if it’s not on the tabu list (i.e. the target
node wasn’t moved in the near past).
 The proposed algorithm satisfies the aspiration criteria, whereas
RNSC does not follow this criterion. Aspiration criteria allows
selection of a move even if it’s already tabued when the best non-
tabu move incurs a cost which is much higher than itself.
 The basic idea is that, if the best move is already tabued instead
of ignoring it, check the feasibility see if this move is going to be
much better than the best non-tabu move existent. This difference
between best move cost (which is in tabu) and best non-tabu
move cost, if less than the aspiration level, then select the non-
tabu move, otherwise select the best move.

Application of a MOVE
 Each node now quickly updates based on the changes it's going
to incur the value for the total edge connections and edge
weight with the neighbouring nodes in the cluster.
 These values later help in O (1) scale cost associated with the
node.
 After the updates on nodes and clusters performed tabu-list is
informed about the changes that have occurred.
 Tabu list now identifies the last target node as tabu with
duration depending on the previous tabu duration value
associated with the target node.

Check
 If the current cost is less than the best cost, set best
cost equal to current cost and save the current cluster
configuration.

Cost Estimation
 A cost change caused by the move is due to the sum of changes
in the cost associated with the nodes of the source and the
destination cluster.
 Direct cost is the change in the cost of a target node (moving
node) itself.
 Induced cost is the sum of changes in cost of nodes belonging
to the source or destination cluster other than the target node.
 The concept of a move consists of two basic steps:
 Remove effect: At first, moves node from a source cluster to an
empty cluster.
 Add effect: Secondly, moves node from the empty cluster to
destination cluster.

Logical view of cost changes with move Evaluations
 Let M be the maximum possible edge weight and Nc be the size
of the cluster “c”.
 Intra Cluster weight = (Nc-1) *M – sum of all edge weights (of
“t”) within the cluster “c”.
 Let total edge weight of t be represented is Wt. The total edge
weight of connections or edges within the cluster c with one of
its vertices being “t” and that is represented as Wc, t.
 Number of edges of the node t is represented as Et.
 Number of connections or edges within the cluster c with on its
vertex being “t” and which is represented as Ec,t.
α= (Wt – Wc,t) + (M*(N-1) – Wc,t)
β= Et + (Nc-1) – Ec,t

Logical view of cost changes with move Evaluations
 Moving a node from a cluster to cluster brings changes in the
costs associated with the nodes in the target and destination
clusters also. This further impacts any move with its source or
destination cluster equal to the last applied move’s source or
destination cluster. So for any move there are two costs
associated.
 Direct Cost : Cost change on the target node “T”
 Induced Cost: Cost change on nodes in source & destination
cluster other than “T”.

Visualization Of BGLL’s Clustering on Email Network

Visualization Of Denshrink-RCONA’s Clustering on
Email Network

Visualization Of BGLL-RCONA’s Clustering on Email
Network

Visualization Of Denshrink-CONA’s Clustering on LFR
Network

Visualization Of BGLL-RCONA’s Clustering on LFR
Network

Concluding Remarks
 The developed algorithm finds a vast practical application to
solve the real-world problems easily and efficiently.
 In case of gene ontology; it will be helpful to explore enhanced
functional homogeneity prototypes.
 Biological research will inevitably carry on to be hypothesis
driven; however, computational methods such as a present
method ACOGCT are likely to become essential for their ability
to methodically recognize areas of significance, and at a much
lower cost.
 Results on several synthetic and real benchmark power-law
graphs highlight the utility of our approach when compared
with RNSC and MCL on the basis of robustness and optimality.

Concluding Remarks
 It is shown that the developed algorithm can reliably and sensitively
extract lower-cost clusters from artificially generated Random
geometric graphs. The modified algorithm gains immense relevance
in the real world situation as presented in this work.
 Visualizations of the resulted clusterings, produced by our
developed algorithm and some baseline methods affirm that the
developed algorithm is generating more expressive and significant
clusters compared to other baseline methods.
 In case of social network analysis, the proposed algorithm is
performing better in terms of cost and NMI compared to MCL,
RMCL and MLRMCL algorithm.
 The proposed algorithm can be further extended by a parallel move
technique which will give better results in the case of run-time.

Concluding Remarks
 The proposed algorithm is giving significant advantage in medical
image segmentation.
 Faster computation in clusters producing compared to FCM’s
clustering results. .
 The information about produced clusters can help in medical
diagnosis. This can be identified and maintained as a statistical
information.
 The proposed algorithm is deterministically found clusters from the
medical image.
 The proposed algorithm can shrink or merge the clustering results
produced by FCM clustering algorithm on medical image.
 The final produced clusters are more clinically significant than the
FCM’s clustering results.

Data mining.pptx

More Related Content

Similar to Data mining.pptx

Recently uploaded

Data mining.pptx