SlideShare a Scribd company logo
1 of 50
Efficient Reduced BIAS
Genetic Algorithm for
Generic Community Detection
Objectives
Aditya Karnam
Outline
● Introduction to Clustering, Biological Networks
● Methods
● Results
● Conclusion
2
1.
Introduction
Clustering, Biological Networks
3
“Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other groups
(clusters). It is a main task of exploratory data mining, and a common
technique for statistical data analysis, used in many fields, including
machine learning, pattern recognition, image analysis, information
retrieval, bioinformatics, data compression, and computer graphics.
-wiki
4
What is clustering?
“Meaningful” Clustering
▷ Preserve context - clusters mean what
they should.
▷ Technique involved affects the nature of
clusters.
▷ Clusters - circular?
▷ Scalability
5
Single Nucleotide Polymorphisms
(SNPs)
▷ DNA is complex.
▷ DNA is made up of a
chain of building
blocks called
nucleotides (A, C, G,
T).
6
▷ A SNP is a mutation of a single nucleotide
(e.g., a change from a C to a T).
▷ SNPs are genomic markers revealing
susceptibility to complex diseases like
Alzheimer's Disease (AD).
Constructing Biological
Networks
▷ SNPs are sequenced for AD cases.
▷ There are millions of these markers
(SNPs) in a dataset.
▷ Complex diseases are a result of
combination of markers.
▷ These SNPs are tested for pairwise
correlation for associations.
▷ Pairwise testing had limited success,
testing trio or higher order is
intractable. 7
Constructing Biological
Networks
▷ SNP markers are modeled into networks.
▷ Each node represents a marker.
▷ Edge between two nodes/markers shows
correlation.
▷ Pearson Correlation Coefficient (PCC) or
Custom Correlation Coefficient (CCC)
are used to identify correlated markers.
(CCC is used for AD networks in results)
▷ Varying the correlation threshold yields
different networks.
8
Properties of Bio Networks
▷ Highly clustered, small inter node
distances.
▷ Some datasets are highly sparse.
▷ Common to see highly dense
components.
▷ Snel (2002) devised way to obtain
protein networks.
▷ Many singleton and doubletons
observed.
9
Why cluster Biological
Networks?
▷ Clustering reveals underlying patterns.
▷ Helpful in understanding gene
regulations.
▷ Mining useful information from noisy
data.
▷ Being able to understand highly
correlated markers to avoid noise.
10
Motivation
▷ Various clustering techniques.
▷ Which method would yield meaningful
information?
▷ Optimally solving Clustering objectives
requires exponential computation time.
(NP-Hard)
▷ Approximation approaches are
necessary for optimization problems.
11
Clustering - Related Works
▷ DBSCAN (Martin et. al 1996).
○ Algorithm based, unspecified objective
▷ K-Means Clustering(2010)
○ Objective based, assumes sphericity of clusters
○ Minimize distances from centroid to
neighbouring nodes.
○ Random centroid assignment results into
inconsistent results.
▷ Modularity (Newman 2006)
○ Objective based, no shape assumption, bias
against singletons and doubletons.
○ Maximize intra cluster edges minus possible
inter cluster edges from corresponding cluster. 12
Genetic Algorithms (GAs)
▷ Randomized search technique.
▷ Typically designed to explore large
solution spaces.
▷ Each solution is represented as
string/real value, as chromosomes.
▷ Involves a population of solutions.
▷ Selection drives the process.
▷ Crossover and Mutation operators for
breeding solutions.
13
Typical GA Flow
14
Steady state vs Generational
15
An Early Clustering GA
▷ Tasgin and Bingol 2006.
▷ Population comprised of nodes being
labelled to corresponding communities.
▷ Labels were generally numbers.
▷ Standard One point crossover and
mutation used.
▷ Evaluating solutions was the most
intensive task.
16
Recent Research
▷ Modularity based Improved GA (MIGA).
○ Developed by Shang, Bai, Jiao, & Jin, 2013
○ Uses prior information on number of clusters.
○ Number of classes in the network used as prior
knowledge.
○ Not efficient with completely new datasets
where classes are not known.
○ Fails to act as a generalized solution for
directed/undirected networks.
17
One big problem!
Representation of Solution space was
inefficient.
18
Problems - Existing methods
This representation clearly is not efficient
since it expands the search space by an order
of K!, where K is the number of clusters.
19
Problems - Existing methods
▷ Existing methods use One point
Crossover which introduces unfounded
bias towards maintaining the linearity of
clusters.
▷ Unwanted or redundant solutions are
introduced due to inefficient solution
space representation.
20
Thesis - Proposal
▷ A Genetic Algorithm for clustering.
▷ Efficient Data representation.
▷ Novel GA Operator.
▷ Method that is flexible with community
detection objectives or in other terms
fitness function.
21
Summary
▷ Clustering is an important tool.
▷ Biological networks have unusual
properties.
▷ Recent work on clustering using GAs are
inefficient.
▷ Community detection objectives are
NP-Hard problems.
▷ GAs could be more optimized.
22
2.
Methods
Datasets and Efficient Reduced Bias Genetic Algorithm
(ERBGA)
23
Datasets
▷ Experiments were tested on 4
benchmark datasets.
○ Zachary’s Football Network
○ Karate Club
○ Dolphin Network
○ Books on US Politics
▷ In addition, we used two real-world
biological datasets from research work
in collaboration with WashU.
○ Named 660k and Omni.
24
Nature of datasets
25
Dataset Nodes Edges
Karate 34 78
Dolphin 62 159
Polbooks 105 441
Football 115 613
660k 962 6672
Omni 2749 57488
ERBGA
▷ Generational GA.
▷ Breeding phases/operators
○ Selection
■ Tournament Selection
○ Uniform Crossover
○ Mutation
○ Gene Repair
■ Help repair the solutions based on node
degree.
○ Elitism
▷ Algorithm is flexible with accepting any
objective. 26
Tuning Parameters
27
Parameter Symbol Value
Number of Generations Gensize 1000-5000
Random Population Rate Prate 0.85
Population Size Psize 250
Elitism Rate Erate 0.2
Number of individuals in
Tournament Pool
TPool 7
Mutation rate Mrate 0.1
Gene Repair Rate GRrate 0.1
Gene Repair chance GRchance 0.05
Gene Repair Size GRSize |E| * GRrate
Number of GA Islands IslandSize 5-25
Population Representation
▷ Example Network - 9
nodes , 11 edges.
▷ Chromosome bit state
○ ‘0’ - Edge Present
○ ‘1’ - Edge Removed
▷ EdgeList :=>
eid = φ (u, v) = V * u + v
▷ φ’(eid) = (eid / V, eid % V).
▷ Unique ‘phenotypic’
representation.
28
Initialization
▷ Random bit strings of length equal to
number of edges.
▷ Randomness is controlled by Random
Population Rate (Prate)
▷ Prate controls the minimum percentage
of 1’s in the chromosome.
29
Elitism
▷ Fittest solutions of the generation are termed
‘elite’.
▷ These ‘elite’ individuals are cloned into next
generation and also included in the selection
phase and may undergo mutations and
crossovers.
▷ Solution quality of the fittest individuals doesn’t
decrease from one generation to the next,
▷ Preserving the diversity of the remainder of the
population.
▷ The number of such individuals is equal to Erate *
Psize.
30
Tournament Selection
Example shows a Tournament Size 3.
31
Uniform Crossover
32
Gene Repair
▷ Edges are removed regardless of the
node property.
▷ High degree node potentially lies in the
cluster.
▷ Gene Repair operator adds edges back
to the high degree nodes in the network.
▷ This overcomes the problem of a high
degree node ending up in wrong cluster.
33
Potential Problem - Gene Repair
34
Islands
▷ We run different independent Island
populations for more diversity.
▷ It helps mitigate bad initial population
problem.
▷ Number of islands 5-25.
35
Efficiency
▷ Complexity for one population
○ O (|V lg (V) + E|)
▷ Implemented in C++. No external
libraries used.
▷ Memory optimization for holding the
chromosomes employ a 3D bit array.
▷ For Omni we reduce the memory by
87.5% with bit array.
36
3D Bit Array
▷ Yellow block hold ‘1’
and grey blocks hold
‘0’.
▷ 1st and 2nd layer flip
(Genold, Gennew) after
each generation.
37
Fitness Function - “Sieve”ing
Qs
▷ In house fitness
function.
▷ No bias against
singleton and
doubletons.
▷ 2 step novel method
BFS and Objective
optimization.
38
Summary
▷ Novel Two level chromosome
representation.
▷ 3-D Bit Storage for Chromosome.
▷ Modularity and Qs - Two objectives
optimized.
▷ Eliminate linearity using Uniform
Crossover.
▷ Unique Gene Repair operator minimizes
negative damage.
39
3.
Results
ERBGA - Compared to Best Results
40
Computation Setup
▷ Experiments were run on i7 2.1 Ghz 8GB
RAM linux machine.
▷ Additional, Qs trials were performed on
the Lewis HPC Cluster at Mizzou.
▷ Experiments were run for 48 hours.
▷ Lewis Cluster node configuration -
exclusive node with 20GB RAM.
41
Results
42
Network BKR MAGA-Net ERBGA
Karate 0.420 0.419 0.420
Dolphin 0.529 0.529 0.445
Polbooks 0.527 0.527 0.256
Football 0.605 0.605 0.073
Dataset Nodes Edges Memory
Consumed
(Megabytes)
Karate 34 78 0.49
Dolphin 62 159 0.69
Polbooks 105 441 1.4
Football 115 613 1.8
660k 962 6672 16.1
Enron 78849 286379 641
Modularity - Fitness
43
Qs - Fitness
44
Modularity vs Qs - Karate
45
Qs Modularity
4.
Conclusion
Inferences
46
Key observations
▷ Meaningful and unique chromosomal
representation.
▷ Previous efforts expanded the search
space by K!
▷ One to one representation from
genotype to phenotype.
▷ Breaks linearity by uniform crossover.
47
Issues to Tackle - Future
▷ Dense networks even after removing lot
of edges could remain a single cluster.
○ Football, Books on US Politics, Omni depicted
this behaviour.
○ Contextual removal of edges needed.
○ Gene Repair partially helps.
▷ Slow mutations when near optimal
solutions.
48
Future Work
▷ Improve method to start evolution early.
▷ Prioritized removal of edges - in
comparison to random removal.
▷ Validating randomized initialization.
▷ Biased selection with more chance to
elite solutions.
▷ Possibly using combination of selection
operators.
49
Thanks!
Any questions?
Aditya Karnam
agtk4@mail.umsl.edu
Thanks to my Advisor: Dr. Sharlee Climer
Friend: Michael Chan.
50

More Related Content

Similar to Efficient Reduced BIAS Genetic Algorithm for Generic Community Detection Objectives

Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”Dr.(Mrs).Gethsiyal Augasta
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detectionroberval mariano
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf4NM20IS025BHUSHANNAY
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
A short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysesA short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysestuxette
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
Algorithm in Social network of graph and social network analysis
Algorithm in Social network of graph and social network analysisAlgorithm in Social network of graph and social network analysis
Algorithm in Social network of graph and social network analysisoliviaclark2905
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...Sunghoon Joo
 
An Evolutionary-based Neural Network for Distinguishing between Genuine and P...
An Evolutionary-based Neural Network for Distinguishing between Genuine and P...An Evolutionary-based Neural Network for Distinguishing between Genuine and P...
An Evolutionary-based Neural Network for Distinguishing between Genuine and P...Md Rakibul Hasan
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsMathias Niepert
 
Clique-based Network Clustering
Clique-based Network ClusteringClique-based Network Clustering
Clique-based Network ClusteringGuang Ouyang
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsJason Riedy
 
4.1 network analysis basic
4.1 network analysis basic4.1 network analysis basic
4.1 network analysis basicjilung hsieh
 
Random Features Strengthen Graph Neural Networks
Random Features Strengthen Graph Neural NetworksRandom Features Strengthen Graph Neural Networks
Random Features Strengthen Graph Neural Networksjoisino
 

Similar to Efficient Reduced BIAS Genetic Algorithm for Generic Community Detection Objectives (20)

Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detection
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf[ML]-Unsupervised-learning_Unit2.ppt.pdf
[ML]-Unsupervised-learning_Unit2.ppt.pdf
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
A short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analysesA short introduction to single-cell RNA-seq analyses
A short introduction to single-cell RNA-seq analyses
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
Algorithm in Social network of graph and social network analysis
Algorithm in Social network of graph and social network analysisAlgorithm in Social network of graph and social network analysis
Algorithm in Social network of graph and social network analysis
 
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
PR-187 : MorphNet: Fast & Simple Resource-Constrained Structure Learning of D...
 
An Evolutionary-based Neural Network for Distinguishing between Genuine and P...
An Evolutionary-based Neural Network for Distinguishing between Genuine and P...An Evolutionary-based Neural Network for Distinguishing between Genuine and P...
An Evolutionary-based Neural Network for Distinguishing between Genuine and P...
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
 
Clique-based Network Clustering
Clique-based Network ClusteringClique-based Network Clustering
Clique-based Network Clustering
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 
May workshop
May workshopMay workshop
May workshop
 
4.1 network analysis basic
4.1 network analysis basic4.1 network analysis basic
4.1 network analysis basic
 
May 15 workshop
May 15  workshopMay 15  workshop
May 15 workshop
 
Random Features Strengthen Graph Neural Networks
Random Features Strengthen Graph Neural NetworksRandom Features Strengthen Graph Neural Networks
Random Features Strengthen Graph Neural Networks
 

Recently uploaded

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 

Recently uploaded (20)

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 

Efficient Reduced BIAS Genetic Algorithm for Generic Community Detection Objectives

  • 1. Efficient Reduced BIAS Genetic Algorithm for Generic Community Detection Objectives Aditya Karnam
  • 2. Outline ● Introduction to Clustering, Biological Networks ● Methods ● Results ● Conclusion 2
  • 4. “Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. -wiki 4 What is clustering?
  • 5. “Meaningful” Clustering ▷ Preserve context - clusters mean what they should. ▷ Technique involved affects the nature of clusters. ▷ Clusters - circular? ▷ Scalability 5
  • 6. Single Nucleotide Polymorphisms (SNPs) ▷ DNA is complex. ▷ DNA is made up of a chain of building blocks called nucleotides (A, C, G, T). 6 ▷ A SNP is a mutation of a single nucleotide (e.g., a change from a C to a T). ▷ SNPs are genomic markers revealing susceptibility to complex diseases like Alzheimer's Disease (AD).
  • 7. Constructing Biological Networks ▷ SNPs are sequenced for AD cases. ▷ There are millions of these markers (SNPs) in a dataset. ▷ Complex diseases are a result of combination of markers. ▷ These SNPs are tested for pairwise correlation for associations. ▷ Pairwise testing had limited success, testing trio or higher order is intractable. 7
  • 8. Constructing Biological Networks ▷ SNP markers are modeled into networks. ▷ Each node represents a marker. ▷ Edge between two nodes/markers shows correlation. ▷ Pearson Correlation Coefficient (PCC) or Custom Correlation Coefficient (CCC) are used to identify correlated markers. (CCC is used for AD networks in results) ▷ Varying the correlation threshold yields different networks. 8
  • 9. Properties of Bio Networks ▷ Highly clustered, small inter node distances. ▷ Some datasets are highly sparse. ▷ Common to see highly dense components. ▷ Snel (2002) devised way to obtain protein networks. ▷ Many singleton and doubletons observed. 9
  • 10. Why cluster Biological Networks? ▷ Clustering reveals underlying patterns. ▷ Helpful in understanding gene regulations. ▷ Mining useful information from noisy data. ▷ Being able to understand highly correlated markers to avoid noise. 10
  • 11. Motivation ▷ Various clustering techniques. ▷ Which method would yield meaningful information? ▷ Optimally solving Clustering objectives requires exponential computation time. (NP-Hard) ▷ Approximation approaches are necessary for optimization problems. 11
  • 12. Clustering - Related Works ▷ DBSCAN (Martin et. al 1996). ○ Algorithm based, unspecified objective ▷ K-Means Clustering(2010) ○ Objective based, assumes sphericity of clusters ○ Minimize distances from centroid to neighbouring nodes. ○ Random centroid assignment results into inconsistent results. ▷ Modularity (Newman 2006) ○ Objective based, no shape assumption, bias against singletons and doubletons. ○ Maximize intra cluster edges minus possible inter cluster edges from corresponding cluster. 12
  • 13. Genetic Algorithms (GAs) ▷ Randomized search technique. ▷ Typically designed to explore large solution spaces. ▷ Each solution is represented as string/real value, as chromosomes. ▷ Involves a population of solutions. ▷ Selection drives the process. ▷ Crossover and Mutation operators for breeding solutions. 13
  • 15. Steady state vs Generational 15
  • 16. An Early Clustering GA ▷ Tasgin and Bingol 2006. ▷ Population comprised of nodes being labelled to corresponding communities. ▷ Labels were generally numbers. ▷ Standard One point crossover and mutation used. ▷ Evaluating solutions was the most intensive task. 16
  • 17. Recent Research ▷ Modularity based Improved GA (MIGA). ○ Developed by Shang, Bai, Jiao, & Jin, 2013 ○ Uses prior information on number of clusters. ○ Number of classes in the network used as prior knowledge. ○ Not efficient with completely new datasets where classes are not known. ○ Fails to act as a generalized solution for directed/undirected networks. 17
  • 18. One big problem! Representation of Solution space was inefficient. 18
  • 19. Problems - Existing methods This representation clearly is not efficient since it expands the search space by an order of K!, where K is the number of clusters. 19
  • 20. Problems - Existing methods ▷ Existing methods use One point Crossover which introduces unfounded bias towards maintaining the linearity of clusters. ▷ Unwanted or redundant solutions are introduced due to inefficient solution space representation. 20
  • 21. Thesis - Proposal ▷ A Genetic Algorithm for clustering. ▷ Efficient Data representation. ▷ Novel GA Operator. ▷ Method that is flexible with community detection objectives or in other terms fitness function. 21
  • 22. Summary ▷ Clustering is an important tool. ▷ Biological networks have unusual properties. ▷ Recent work on clustering using GAs are inefficient. ▷ Community detection objectives are NP-Hard problems. ▷ GAs could be more optimized. 22
  • 23. 2. Methods Datasets and Efficient Reduced Bias Genetic Algorithm (ERBGA) 23
  • 24. Datasets ▷ Experiments were tested on 4 benchmark datasets. ○ Zachary’s Football Network ○ Karate Club ○ Dolphin Network ○ Books on US Politics ▷ In addition, we used two real-world biological datasets from research work in collaboration with WashU. ○ Named 660k and Omni. 24
  • 25. Nature of datasets 25 Dataset Nodes Edges Karate 34 78 Dolphin 62 159 Polbooks 105 441 Football 115 613 660k 962 6672 Omni 2749 57488
  • 26. ERBGA ▷ Generational GA. ▷ Breeding phases/operators ○ Selection ■ Tournament Selection ○ Uniform Crossover ○ Mutation ○ Gene Repair ■ Help repair the solutions based on node degree. ○ Elitism ▷ Algorithm is flexible with accepting any objective. 26
  • 27. Tuning Parameters 27 Parameter Symbol Value Number of Generations Gensize 1000-5000 Random Population Rate Prate 0.85 Population Size Psize 250 Elitism Rate Erate 0.2 Number of individuals in Tournament Pool TPool 7 Mutation rate Mrate 0.1 Gene Repair Rate GRrate 0.1 Gene Repair chance GRchance 0.05 Gene Repair Size GRSize |E| * GRrate Number of GA Islands IslandSize 5-25
  • 28. Population Representation ▷ Example Network - 9 nodes , 11 edges. ▷ Chromosome bit state ○ ‘0’ - Edge Present ○ ‘1’ - Edge Removed ▷ EdgeList :=> eid = φ (u, v) = V * u + v ▷ φ’(eid) = (eid / V, eid % V). ▷ Unique ‘phenotypic’ representation. 28
  • 29. Initialization ▷ Random bit strings of length equal to number of edges. ▷ Randomness is controlled by Random Population Rate (Prate) ▷ Prate controls the minimum percentage of 1’s in the chromosome. 29
  • 30. Elitism ▷ Fittest solutions of the generation are termed ‘elite’. ▷ These ‘elite’ individuals are cloned into next generation and also included in the selection phase and may undergo mutations and crossovers. ▷ Solution quality of the fittest individuals doesn’t decrease from one generation to the next, ▷ Preserving the diversity of the remainder of the population. ▷ The number of such individuals is equal to Erate * Psize. 30
  • 31. Tournament Selection Example shows a Tournament Size 3. 31
  • 33. Gene Repair ▷ Edges are removed regardless of the node property. ▷ High degree node potentially lies in the cluster. ▷ Gene Repair operator adds edges back to the high degree nodes in the network. ▷ This overcomes the problem of a high degree node ending up in wrong cluster. 33
  • 34. Potential Problem - Gene Repair 34
  • 35. Islands ▷ We run different independent Island populations for more diversity. ▷ It helps mitigate bad initial population problem. ▷ Number of islands 5-25. 35
  • 36. Efficiency ▷ Complexity for one population ○ O (|V lg (V) + E|) ▷ Implemented in C++. No external libraries used. ▷ Memory optimization for holding the chromosomes employ a 3D bit array. ▷ For Omni we reduce the memory by 87.5% with bit array. 36
  • 37. 3D Bit Array ▷ Yellow block hold ‘1’ and grey blocks hold ‘0’. ▷ 1st and 2nd layer flip (Genold, Gennew) after each generation. 37
  • 38. Fitness Function - “Sieve”ing Qs ▷ In house fitness function. ▷ No bias against singleton and doubletons. ▷ 2 step novel method BFS and Objective optimization. 38
  • 39. Summary ▷ Novel Two level chromosome representation. ▷ 3-D Bit Storage for Chromosome. ▷ Modularity and Qs - Two objectives optimized. ▷ Eliminate linearity using Uniform Crossover. ▷ Unique Gene Repair operator minimizes negative damage. 39
  • 40. 3. Results ERBGA - Compared to Best Results 40
  • 41. Computation Setup ▷ Experiments were run on i7 2.1 Ghz 8GB RAM linux machine. ▷ Additional, Qs trials were performed on the Lewis HPC Cluster at Mizzou. ▷ Experiments were run for 48 hours. ▷ Lewis Cluster node configuration - exclusive node with 20GB RAM. 41
  • 42. Results 42 Network BKR MAGA-Net ERBGA Karate 0.420 0.419 0.420 Dolphin 0.529 0.529 0.445 Polbooks 0.527 0.527 0.256 Football 0.605 0.605 0.073 Dataset Nodes Edges Memory Consumed (Megabytes) Karate 34 78 0.49 Dolphin 62 159 0.69 Polbooks 105 441 1.4 Football 115 613 1.8 660k 962 6672 16.1 Enron 78849 286379 641
  • 45. Modularity vs Qs - Karate 45 Qs Modularity
  • 47. Key observations ▷ Meaningful and unique chromosomal representation. ▷ Previous efforts expanded the search space by K! ▷ One to one representation from genotype to phenotype. ▷ Breaks linearity by uniform crossover. 47
  • 48. Issues to Tackle - Future ▷ Dense networks even after removing lot of edges could remain a single cluster. ○ Football, Books on US Politics, Omni depicted this behaviour. ○ Contextual removal of edges needed. ○ Gene Repair partially helps. ▷ Slow mutations when near optimal solutions. 48
  • 49. Future Work ▷ Improve method to start evolution early. ▷ Prioritized removal of edges - in comparison to random removal. ▷ Validating randomized initialization. ▷ Biased selection with more chance to elite solutions. ▷ Possibly using combination of selection operators. 49
  • 50. Thanks! Any questions? Aditya Karnam agtk4@mail.umsl.edu Thanks to my Advisor: Dr. Sharlee Climer Friend: Michael Chan. 50