This presentation is presented for the thesis defense for ERBGA Dissertation for partial fulfillment of graduation for Master in Computer Science at UMSL on 18th April 2018 by Aditya Karnam.
4. “Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other groups
(clusters). It is a main task of exploratory data mining, and a common
technique for statistical data analysis, used in many fields, including
machine learning, pattern recognition, image analysis, information
retrieval, bioinformatics, data compression, and computer graphics.
-wiki
4
What is clustering?
5. “Meaningful” Clustering
▷ Preserve context - clusters mean what
they should.
▷ Technique involved affects the nature of
clusters.
▷ Clusters - circular?
▷ Scalability
5
6. Single Nucleotide Polymorphisms
(SNPs)
▷ DNA is complex.
▷ DNA is made up of a
chain of building
blocks called
nucleotides (A, C, G,
T).
6
▷ A SNP is a mutation of a single nucleotide
(e.g., a change from a C to a T).
▷ SNPs are genomic markers revealing
susceptibility to complex diseases like
Alzheimer's Disease (AD).
7. Constructing Biological
Networks
▷ SNPs are sequenced for AD cases.
▷ There are millions of these markers
(SNPs) in a dataset.
▷ Complex diseases are a result of
combination of markers.
▷ These SNPs are tested for pairwise
correlation for associations.
▷ Pairwise testing had limited success,
testing trio or higher order is
intractable. 7
8. Constructing Biological
Networks
▷ SNP markers are modeled into networks.
▷ Each node represents a marker.
▷ Edge between two nodes/markers shows
correlation.
▷ Pearson Correlation Coefficient (PCC) or
Custom Correlation Coefficient (CCC)
are used to identify correlated markers.
(CCC is used for AD networks in results)
▷ Varying the correlation threshold yields
different networks.
8
9. Properties of Bio Networks
▷ Highly clustered, small inter node
distances.
▷ Some datasets are highly sparse.
▷ Common to see highly dense
components.
▷ Snel (2002) devised way to obtain
protein networks.
▷ Many singleton and doubletons
observed.
9
10. Why cluster Biological
Networks?
▷ Clustering reveals underlying patterns.
▷ Helpful in understanding gene
regulations.
▷ Mining useful information from noisy
data.
▷ Being able to understand highly
correlated markers to avoid noise.
10
11. Motivation
▷ Various clustering techniques.
▷ Which method would yield meaningful
information?
▷ Optimally solving Clustering objectives
requires exponential computation time.
(NP-Hard)
▷ Approximation approaches are
necessary for optimization problems.
11
12. Clustering - Related Works
▷ DBSCAN (Martin et. al 1996).
○ Algorithm based, unspecified objective
▷ K-Means Clustering(2010)
○ Objective based, assumes sphericity of clusters
○ Minimize distances from centroid to
neighbouring nodes.
○ Random centroid assignment results into
inconsistent results.
▷ Modularity (Newman 2006)
○ Objective based, no shape assumption, bias
against singletons and doubletons.
○ Maximize intra cluster edges minus possible
inter cluster edges from corresponding cluster. 12
13. Genetic Algorithms (GAs)
▷ Randomized search technique.
▷ Typically designed to explore large
solution spaces.
▷ Each solution is represented as
string/real value, as chromosomes.
▷ Involves a population of solutions.
▷ Selection drives the process.
▷ Crossover and Mutation operators for
breeding solutions.
13
16. An Early Clustering GA
▷ Tasgin and Bingol 2006.
▷ Population comprised of nodes being
labelled to corresponding communities.
▷ Labels were generally numbers.
▷ Standard One point crossover and
mutation used.
▷ Evaluating solutions was the most
intensive task.
16
17. Recent Research
▷ Modularity based Improved GA (MIGA).
○ Developed by Shang, Bai, Jiao, & Jin, 2013
○ Uses prior information on number of clusters.
○ Number of classes in the network used as prior
knowledge.
○ Not efficient with completely new datasets
where classes are not known.
○ Fails to act as a generalized solution for
directed/undirected networks.
17
19. Problems - Existing methods
This representation clearly is not efficient
since it expands the search space by an order
of K!, where K is the number of clusters.
19
20. Problems - Existing methods
▷ Existing methods use One point
Crossover which introduces unfounded
bias towards maintaining the linearity of
clusters.
▷ Unwanted or redundant solutions are
introduced due to inefficient solution
space representation.
20
21. Thesis - Proposal
▷ A Genetic Algorithm for clustering.
▷ Efficient Data representation.
▷ Novel GA Operator.
▷ Method that is flexible with community
detection objectives or in other terms
fitness function.
21
22. Summary
▷ Clustering is an important tool.
▷ Biological networks have unusual
properties.
▷ Recent work on clustering using GAs are
inefficient.
▷ Community detection objectives are
NP-Hard problems.
▷ GAs could be more optimized.
22
24. Datasets
▷ Experiments were tested on 4
benchmark datasets.
○ Zachary’s Football Network
○ Karate Club
○ Dolphin Network
○ Books on US Politics
▷ In addition, we used two real-world
biological datasets from research work
in collaboration with WashU.
○ Named 660k and Omni.
24
25. Nature of datasets
25
Dataset Nodes Edges
Karate 34 78
Dolphin 62 159
Polbooks 105 441
Football 115 613
660k 962 6672
Omni 2749 57488
26. ERBGA
▷ Generational GA.
▷ Breeding phases/operators
○ Selection
■ Tournament Selection
○ Uniform Crossover
○ Mutation
○ Gene Repair
■ Help repair the solutions based on node
degree.
○ Elitism
▷ Algorithm is flexible with accepting any
objective. 26
27. Tuning Parameters
27
Parameter Symbol Value
Number of Generations Gensize 1000-5000
Random Population Rate Prate 0.85
Population Size Psize 250
Elitism Rate Erate 0.2
Number of individuals in
Tournament Pool
TPool 7
Mutation rate Mrate 0.1
Gene Repair Rate GRrate 0.1
Gene Repair chance GRchance 0.05
Gene Repair Size GRSize |E| * GRrate
Number of GA Islands IslandSize 5-25
28. Population Representation
▷ Example Network - 9
nodes , 11 edges.
▷ Chromosome bit state
○ ‘0’ - Edge Present
○ ‘1’ - Edge Removed
▷ EdgeList :=>
eid = φ (u, v) = V * u + v
▷ φ’(eid) = (eid / V, eid % V).
▷ Unique ‘phenotypic’
representation.
28
29. Initialization
▷ Random bit strings of length equal to
number of edges.
▷ Randomness is controlled by Random
Population Rate (Prate)
▷ Prate controls the minimum percentage
of 1’s in the chromosome.
29
30. Elitism
▷ Fittest solutions of the generation are termed
‘elite’.
▷ These ‘elite’ individuals are cloned into next
generation and also included in the selection
phase and may undergo mutations and
crossovers.
▷ Solution quality of the fittest individuals doesn’t
decrease from one generation to the next,
▷ Preserving the diversity of the remainder of the
population.
▷ The number of such individuals is equal to Erate *
Psize.
30
33. Gene Repair
▷ Edges are removed regardless of the
node property.
▷ High degree node potentially lies in the
cluster.
▷ Gene Repair operator adds edges back
to the high degree nodes in the network.
▷ This overcomes the problem of a high
degree node ending up in wrong cluster.
33
35. Islands
▷ We run different independent Island
populations for more diversity.
▷ It helps mitigate bad initial population
problem.
▷ Number of islands 5-25.
35
36. Efficiency
▷ Complexity for one population
○ O (|V lg (V) + E|)
▷ Implemented in C++. No external
libraries used.
▷ Memory optimization for holding the
chromosomes employ a 3D bit array.
▷ For Omni we reduce the memory by
87.5% with bit array.
36
37. 3D Bit Array
▷ Yellow block hold ‘1’
and grey blocks hold
‘0’.
▷ 1st and 2nd layer flip
(Genold, Gennew) after
each generation.
37
38. Fitness Function - “Sieve”ing
Qs
▷ In house fitness
function.
▷ No bias against
singleton and
doubletons.
▷ 2 step novel method
BFS and Objective
optimization.
38
39. Summary
▷ Novel Two level chromosome
representation.
▷ 3-D Bit Storage for Chromosome.
▷ Modularity and Qs - Two objectives
optimized.
▷ Eliminate linearity using Uniform
Crossover.
▷ Unique Gene Repair operator minimizes
negative damage.
39
41. Computation Setup
▷ Experiments were run on i7 2.1 Ghz 8GB
RAM linux machine.
▷ Additional, Qs trials were performed on
the Lewis HPC Cluster at Mizzou.
▷ Experiments were run for 48 hours.
▷ Lewis Cluster node configuration -
exclusive node with 20GB RAM.
41
47. Key observations
▷ Meaningful and unique chromosomal
representation.
▷ Previous efforts expanded the search
space by K!
▷ One to one representation from
genotype to phenotype.
▷ Breaks linearity by uniform crossover.
47
48. Issues to Tackle - Future
▷ Dense networks even after removing lot
of edges could remain a single cluster.
○ Football, Books on US Politics, Omni depicted
this behaviour.
○ Contextual removal of edges needed.
○ Gene Repair partially helps.
▷ Slow mutations when near optimal
solutions.
48
49. Future Work
▷ Improve method to start evolution early.
▷ Prioritized removal of edges - in
comparison to random removal.
▷ Validating randomized initialization.
▷ Biased selection with more chance to
elite solutions.
▷ Possibly using combination of selection
operators.
49