SlideShare a Scribd company logo
1 of 1
Download to read offline
MINING DNA NETWORK EXPRESSIONS
WITH EVOLUTIONARY MULTI-OBJECTIVE PROGRAMMINGWITH EVOLUTIONARY MULTI-OBJECTIVE PROGRAMMING
MINING DNA NETWORK EXPRESSIONS
Manuel Belmadani, M. Computer Science (specialization in Bioinformatics) Candidate
mbelm006@uottawa.ca
The proposed solution (MotifGP) is a motif discovery tool driven by a Strongly Typed
Genetic Programming (STGP) algorithm equipped with multi-objective optimization.
STGP is a stochastic search process based off evolutionary algorithms which evolves
candidate solutions as typed checked programs. Multi-objective optimization allows the
algorithm to evolve according to multiple metrics used to build predictions that are high in
both support and specificificity. The software produces a collection of non-dominated
multi-objective solutions, which can then be aligned to reference DNA motif databases
(such as JASPAR, TRANSFAC, and Uniprobe) in order to validate/identify predictions.
Abstract
High-throughput sequencing experiments, such as ChIP-seq, identify protein interactions
within large samples of DNA. Such experiments may identify and output thousands of
peak sequences where an interaction of interest is regulated by that region of DNA.
Computational biologists often require a de novo analysis of these peaks to mine short
overrepresented patterns, frequently found amongst the sequences. This task is referred
to as motif discovery. Existing tools struggle at motif discovery if the number of input
sequences increases. The time and complexity costs of calculating pattern occurrence
often requires compromises on runtime, quality, or length of the output predictions.
[1] Chen, X. et al. Integration of external signaling pathways with the core transcriptional
network in embryonic stem cells. Cell 133, 1106–1117 (2008).
[2] Bailey, T. L. DREME: motif discovery in transcription factor ChIP-seq data.
Bioinformatics 27, 1653–1659 (2011).
[3] Fortin, F.-A. & Parizeau, M. Revisiting the NSGA-II crowding-distance computation.
in Proceedings of the 15th annual conference on Genetic and evolutionary computation
623–630 (ACM, 2013).
[4] Fortin, F.-A., De Rainville, F.-M., Gardner, M.-A. G., Parizeau, M. & Gagné, C.
DEAP: Evolutionary Algorithms Made Easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)
[5] Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying
similarity between motifs. Genome Biol. 8, R24 (2007)..
References
MA0139.1
YDSYGSCVYSTDSTRDBCV Tomtom 13.03.15 17:31
0
1
2
bits
1
G
C
T
2
T
G
A
3
T
C
G
4
T
C
5
A
T
G
6
T
A
C
7
C
8
T
A
C
9
T
C
10
C
11
A
C
T
12
T
A
G
13
T
C
G
14
C
A
T
15
G
16
C
A
T
G
17
A
T
C
18
T
A
C
19
C
T
G
A
0
1
2
bits
1
C
T
2
A
G
T
3
C
G
4
C
T
5
G
6
C
G
7
C
8
A
C
G
9
C
T
10
C
G
11
T
12
A
G
T
13
C
G
14
T
15
A
G
16
A
G
T
17
C
G
T
18
C
19
A
C
G
MA0139.1
AGRKGGCR Tomtom 13.03.15 13:33
0
1
2
bits
1
G
A
C
T
2
A
T
G
3
T
A
G
4
G
T
A
C
5
C
6
G
T
A
7
A
G
C
8
A
T
C
9
T
G
A10
G11A
G
12
A
T
G
13
G
14
A
T
G
15
T
A
C
16
A
G
17
A
G
C
18
A
C
T
19
C
G
A
0
1
2
bits
1
A
2
G
3
A
G
4
T
G
5
G
6
G7
C8
G
A
MOTIFGP DREME
CTCF
MotifGP's motif
fully overlaps the
19 bases of CTCF.
Each tool's
prediction targeted
a different strand
(orientation),
showing a
complementation
between each
methods.
E-value: 6.13471E-09 E-value: 3.50248E-07
MA0145.2
WCCVGHTHVDDCCDG Tomtom 13.03.15 17:30
0
1
2
bits
1
G
C
2
A
T
C
3
T
G
A
4
C
G
5
A
C
T
6
G
A
C
T
7
T
C
8
9
C
T
G
A
10
T
G
A
11
G
C
12
A
T
C
13
T
G
A
14
C
G
0
1
2
bits
1
A
T
2
C
3
C
4
A
C
G
5
G
6
A
C
T
7
T
8
A
C
T
9
A
C
G
10
A
G
T
11
A
G
T
12
C
13
C
14
A
G
T
15
G
MA0145.2
ADCCRG Tomtom 13.03.15 13:33
0
1
2
bits
1G
C2
A
T
C
3
T
G
A
4
C
G
5
A
C
T
6
G
A
C
T
7
T
C
8
9
C
T
G
A
10
T
G
A
11
G
C
12
A
T
C
13
T
G
A
14
C
G
0
1
2
bits
1
A
2
T
A
G
3
C
4
C
5
G
A
6
G
Tcfcp2l1
MOTIFGP DREME
The Tcfcp2l1 motif has two
fragments joined by a weak gap
(position 7 to 8). Evolutionary
computing allows such fragments
of a motif to evolve independently
and eventually crossover, allowing
short gaps to naturally appear in
produced predictions.
E-value: 1.33929E-07 E-value: 3.07791E-03
MA0039.2
GGGYGKGK Tomtom 13.03.15 13:33
0
1
2
bits
1
G
A
T
2
A
G
3
G
4
G
5
A
C
T
6
G
7
T
G
8
T
A
G
9
A
C
T
G
10
A
G
T
C
0
1
2
bits
1
G
2
G
3
G
4
C
T
5
G
6
G
T
7
G
8
T
G
MA0039.2
GYGDGGGYGKGGC Tomtom 13.03.15 14:20
0
1
2
bits
1
G
A
T
2
A
G
3
G
4
G
5
A
C
T
6
G
7
T
G
8
T
A
G
9
A
C
T
G
10
A
G
T
C
0
1
2
bits
1
G
2
C
T
3
G
4
A
G
T
5
G6
G7
G8C
T
9
G10
G
T
11
G
12
G
13
C
MOTIFGP DREME
Klf4
As shown by the Klf4 dataset prediction,
MotifGP has the ability to overlap an entire
target motif. It even finds additional
nucleotides attached to the left end of the
predicted motif, found in the input
sequences.E-value: 1.18496E-09 E-value: 3.28351E-08
Experimental Results
A sample of top ranking motifs identified by
MotifGP are compared to DREME's prediction on
the same datasets. Each top logo is the target motif
from the database, aligned by TOMTOM. The
bottom logos are the predictions by MotifGP (at the
left), and DREME, (at the right).
Benchmarking
For each dataset, MotifGP was executed 10 times with
a different random initialization. The E-value of the
top DREME motif alignment is used as the benchmark
value for MotifGP.
Over the 10 individual runs of each dataset, we
counted the number of runs where a prediction
outperformed the benchmark. We also looked at the
runs that are within up to 3 orders of magnitude of the
benchmark E-Value. This experiment shows the
algorithm performed favorably in term of stability in
11 out of 13 datasets, including 6 where MotifGP
suceeded in all 10 runs.
Dataset <
Terminology
Motif: A recurring feature in a set of elements, such as frequent patterns in
DNA strings.
Transcription Factor (TF): Proteins that bind to DNA at specific sites.
Network Expressions: Subset of the regular expression grammar. Defines patterns
in which each position has a set of possible characters.
DNA Motif Discovery: Mining DNA sequences for motifs. Can be used to identify TF
binding sites.
To determine if MotifGP can
outperform DREME, we
compared both software
against 13 datasets from
mouse Embryonic Stem
Cells ChIP-Seq
experiments [1].
The recorded runtimes for
each DREME run were used
as time limits for MotifGP.
Each dataset may contain
one or more factor motif.
Desirable predictions
should report long overlap
and the lowest E-values
once aligned to known
motifs.
MethodologyStrongly Typed Genetic Programming
AAC[CT]
+ +
A A [ ]C
TC
Strongly Typed Network Expression grammar
Genetic Programming is a search
heuristic where a population of
programs created under a template
grammar are evolved over generations
of crossovers and mutations. MotifGP
uses a grammar to produce Network
Expressions that can be matched
against sequences.
In population based evolutionary
algorithms, new individuals are
created over generations and
evaluated by a fitness function.
This scores the individuals in how
well they perform as solutions. In
multi-objective optimization, the
individuals have more than one
fitness values. Fitnesses with
multiple values are compared and
ranked by a multi-objective
optimization algorithm [3].
MotifGP matches candidate solutions (network
expressions) against a set of input and control
sequences. We use the match count on each set to
compute two metrics of discrimination and coverage:
For a given pattern, let P be the set of
matched input sequences, C be the set
of matched control sequences, and k be
the total number of input sequences.
Discrimination favors solutions that are not highly
represented in the control sequences, while
Coverage maintains solutions that are found in
multiple input sequences.
Multi-objective optimization
Conclusions and Future Work
Combining STGP and multi-objective
optimization directs the solution search
towards motifs that align with high
confidence and fully overlaps known motifs
from databases.
Other existing ChIP-seq experiments can be
revisited using MotifGP as it may be able to
uncover subtle patterns that other tools could
not.
Future development of MotifGP will
investigate different multi-objective fitness
function variants and explore other
evolutionary methods.
While DREME predictions have significant E-values, the pattern length is limited to 8
characters. Some known motifs are more than twice the size DREME can predict. This is a
restriction set to keep low runtimes under real ChIP-seq datasets. We propose our software,
MotifGP, as an alternative to this limitation. Our goal is to explore the potential of
multi-objective evolutionary computing in finding significant motif predictions on
large datasets, under runtimes comparable to DREME.
GGTAGAAATA
TTAAAACCT
ATAAACCT
AAAACCT MotifGP
DREME
Alignment
TOMTOM
Motif discovery
OR
A[AT]A
A[AC][CT]
AAC[CT]
...
...
DNA sequences
Motif predictions
JASPAR
Uniprobe
Motif databases
Alignment score
MA0039.2
GYGDGGGYGKGGC Tomtom 13.03.15 14:20
0
1
2
bits
1
G
A
T
2
A
G
3
G
4
G
5
A
C
T
6
G
7
T
G
8
T
A
G
9
A
C
T
G
10
A
G
T
C
0
1
2
bits
1
G
2
C
T
3
G
4
A
G
T
5
G
6
G
7
G
8
C
T
9
G
10
G
T
11
G
12
G
13
C
E-value:1.18E-08
Target: Klf4
(MA0039.2)
...
De novo motif discovery pipeline
Background and Motivation
A state-of-the-art motif discovery tool, DREME, predicts DNA motifs by mining and refining k-
mer (words) from an input set of sequences [2]. The predictions are compared to known motif
databases with TOMTOM [5]. Reported alignments are listed and scored by E-value.

More Related Content

What's hot

GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGGENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data MiningA Review on Text Mining in Data Mining
A Review on Text Mining in Data Miningijsc
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithmsIkutwa
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Anubhav Jain
 
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...Kishor Datta Gupta
 
Meta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter OptimizationMeta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter OptimizationPriyatham Bollimpalli
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
 
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalProbabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalYI-JHEN LIN
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..butest
 
Gaining Confidence in Signalling and Regulatory Networks
Gaining Confidence in Signalling and Regulatory NetworksGaining Confidence in Signalling and Regulatory Networks
Gaining Confidence in Signalling and Regulatory NetworksMichael Stumpf
 

What's hot (20)

[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGGENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
 
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data MiningA Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
 
Nlp text classification
Nlp text classificationNlp text classification
Nlp text classification
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Meta-Learning Presentation
Meta-Learning PresentationMeta-Learning Presentation
Meta-Learning Presentation
 
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...Extracting and Making Use of Materials Data from Millions of Journal Articles...
Extracting and Making Use of Materials Data from Millions of Journal Articles...
 
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
 
Meta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter OptimizationMeta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter Optimization
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
 
P33077080
P33077080P33077080
P33077080
 
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic RetrievalProbabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..
 
Gaining Confidence in Signalling and Regulatory Networks
Gaining Confidence in Signalling and Regulatory NetworksGaining Confidence in Signalling and Regulatory Networks
Gaining Confidence in Signalling and Regulatory Networks
 
E43022023
E43022023E43022023
E43022023
 

Viewers also liked

Desarrollo de un Inversor sinusoidal
Desarrollo de un Inversor sinusoidal Desarrollo de un Inversor sinusoidal
Desarrollo de un Inversor sinusoidal Edwin Marte
 
Sumador/Restador en Punto Flotante IEEE 745 en el MIPS
Sumador/Restador en Punto Flotante IEEE 745 en el MIPSSumador/Restador en Punto Flotante IEEE 745 en el MIPS
Sumador/Restador en Punto Flotante IEEE 745 en el MIPSSNPP
 
Planificacion de Electronica industrial
Planificacion de Electronica industrialPlanificacion de Electronica industrial
Planificacion de Electronica industrialindira
 
Circuitos reguladores de tension jhonatan
Circuitos reguladores de tension   jhonatanCircuitos reguladores de tension   jhonatan
Circuitos reguladores de tension jhonatanjhonatan1810
 
Dimensionamento de sistemas fotovoltaicos
Dimensionamento de sistemas fotovoltaicosDimensionamento de sistemas fotovoltaicos
Dimensionamento de sistemas fotovoltaicosLaurindo Carinhas
 
Ciencia de los materiales
Ciencia de los materialesCiencia de los materiales
Ciencia de los materialesDiego Gomez
 
Guia laboratorio 1 de electrónica nueva
Guia laboratorio 1 de electrónica nuevaGuia laboratorio 1 de electrónica nueva
Guia laboratorio 1 de electrónica nuevaWalter Cordoba
 
Trabajo modulación (AM)
Trabajo modulación (AM)Trabajo modulación (AM)
Trabajo modulación (AM)Cristopher_G
 
Comunicaciones modulacion am
Comunicaciones modulacion amComunicaciones modulacion am
Comunicaciones modulacion amVicente Bermudez
 
Electronica 2015 competencias-capacidades
Electronica   2015  competencias-capacidadesElectronica   2015  competencias-capacidades
Electronica 2015 competencias-capacidadesyoelaguila
 
Amplitud modulada am
Amplitud modulada amAmplitud modulada am
Amplitud modulada amralch1978
 
Taller de baterias
Taller de bateriasTaller de baterias
Taller de bateriassfenollar
 

Viewers also liked (20)

Desarrollo de un Inversor sinusoidal
Desarrollo de un Inversor sinusoidal Desarrollo de un Inversor sinusoidal
Desarrollo de un Inversor sinusoidal
 
inversor-dc-ac-con-cargador-de-bateria
inversor-dc-ac-con-cargador-de-bateriainversor-dc-ac-con-cargador-de-bateria
inversor-dc-ac-con-cargador-de-bateria
 
Sumador/Restador en Punto Flotante IEEE 745 en el MIPS
Sumador/Restador en Punto Flotante IEEE 745 en el MIPSSumador/Restador en Punto Flotante IEEE 745 en el MIPS
Sumador/Restador en Punto Flotante IEEE 745 en el MIPS
 
Las webquest jeicy
Las webquest jeicyLas webquest jeicy
Las webquest jeicy
 
Electrónica (1)
Electrónica (1)Electrónica (1)
Electrónica (1)
 
Planificacion de Electronica industrial
Planificacion de Electronica industrialPlanificacion de Electronica industrial
Planificacion de Electronica industrial
 
Proyecto electronica
Proyecto electronicaProyecto electronica
Proyecto electronica
 
Circuitos reguladores de tension jhonatan
Circuitos reguladores de tension   jhonatanCircuitos reguladores de tension   jhonatan
Circuitos reguladores de tension jhonatan
 
Dimensionamento de sistemas fotovoltaicos
Dimensionamento de sistemas fotovoltaicosDimensionamento de sistemas fotovoltaicos
Dimensionamento de sistemas fotovoltaicos
 
Ciencia de los materiales
Ciencia de los materialesCiencia de los materiales
Ciencia de los materiales
 
MOD-20160810
MOD-20160810MOD-20160810
MOD-20160810
 
Guia laboratorio 1 de electrónica nueva
Guia laboratorio 1 de electrónica nuevaGuia laboratorio 1 de electrónica nueva
Guia laboratorio 1 de electrónica nueva
 
Trabajo modulación (AM)
Trabajo modulación (AM)Trabajo modulación (AM)
Trabajo modulación (AM)
 
Oscilador de puente wien
Oscilador de puente wienOscilador de puente wien
Oscilador de puente wien
 
Comunicaciones modulacion am
Comunicaciones modulacion amComunicaciones modulacion am
Comunicaciones modulacion am
 
Practica6 0708
Practica6 0708Practica6 0708
Practica6 0708
 
Fuente rectificadora onda completa diagrama
Fuente rectificadora onda completa diagramaFuente rectificadora onda completa diagrama
Fuente rectificadora onda completa diagrama
 
Electronica 2015 competencias-capacidades
Electronica   2015  competencias-capacidadesElectronica   2015  competencias-capacidades
Electronica 2015 competencias-capacidades
 
Amplitud modulada am
Amplitud modulada amAmplitud modulada am
Amplitud modulada am
 
Taller de baterias
Taller de bateriasTaller de baterias
Taller de baterias
 

Similar to 2015-03-31_MotifGP

IRJET-A Novel Approaches for Motif Discovery using Data Mining Algorithm
IRJET-A Novel Approaches for Motif Discovery using Data Mining AlgorithmIRJET-A Novel Approaches for Motif Discovery using Data Mining Algorithm
IRJET-A Novel Approaches for Motif Discovery using Data Mining AlgorithmIRJET Journal
 
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingA Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingIJERA Editor
 
Performance Efficient DNA Sequence Detectionalgo
Performance Efficient DNA Sequence DetectionalgoPerformance Efficient DNA Sequence Detectionalgo
Performance Efficient DNA Sequence DetectionalgoRahul Shirude
 
An analogy of algorithms for tagging of single nucleotide polymorphism and ev
An analogy of algorithms for tagging of single nucleotide polymorphism and evAn analogy of algorithms for tagging of single nucleotide polymorphism and ev
An analogy of algorithms for tagging of single nucleotide polymorphism and evIAEME Publication
 
An interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patternsAn interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patternsRavi Kumar
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classificationcsandit
 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...cscpconf
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
 
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHGPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHijdms
 
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...CSCJournals
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal ClubMed_KU
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...BaoTramDuong2
 
Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...eSAT Journals
 
Comparative analysis of dynamic programming
Comparative analysis of dynamic programmingComparative analysis of dynamic programming
Comparative analysis of dynamic programmingeSAT Publishing House
 

Similar to 2015-03-31_MotifGP (20)

IRJET-A Novel Approaches for Motif Discovery using Data Mining Algorithm
IRJET-A Novel Approaches for Motif Discovery using Data Mining AlgorithmIRJET-A Novel Approaches for Motif Discovery using Data Mining Algorithm
IRJET-A Novel Approaches for Motif Discovery using Data Mining Algorithm
 
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingA Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
 
Performance Efficient DNA Sequence Detectionalgo
Performance Efficient DNA Sequence DetectionalgoPerformance Efficient DNA Sequence Detectionalgo
Performance Efficient DNA Sequence Detectionalgo
 
An analogy of algorithms for tagging of single nucleotide polymorphism and ev
An analogy of algorithms for tagging of single nucleotide polymorphism and evAn analogy of algorithms for tagging of single nucleotide polymorphism and ev
An analogy of algorithms for tagging of single nucleotide polymorphism and ev
 
2224d_final
2224d_final2224d_final
2224d_final
 
An interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patternsAn interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patterns
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
presentation
presentationpresentation
presentation
 
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classification
 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
 
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHGPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
 
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
Biclustering using Parallel Fuzzy Approach for Analysis of Microarray Gene Ex...
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club
 
An26247254
An26247254An26247254
An26247254
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...
 
Comparative analysis of dynamic programming
Comparative analysis of dynamic programmingComparative analysis of dynamic programming
Comparative analysis of dynamic programming
 

2015-03-31_MotifGP

  • 1. MINING DNA NETWORK EXPRESSIONS WITH EVOLUTIONARY MULTI-OBJECTIVE PROGRAMMINGWITH EVOLUTIONARY MULTI-OBJECTIVE PROGRAMMING MINING DNA NETWORK EXPRESSIONS Manuel Belmadani, M. Computer Science (specialization in Bioinformatics) Candidate mbelm006@uottawa.ca The proposed solution (MotifGP) is a motif discovery tool driven by a Strongly Typed Genetic Programming (STGP) algorithm equipped with multi-objective optimization. STGP is a stochastic search process based off evolutionary algorithms which evolves candidate solutions as typed checked programs. Multi-objective optimization allows the algorithm to evolve according to multiple metrics used to build predictions that are high in both support and specificificity. The software produces a collection of non-dominated multi-objective solutions, which can then be aligned to reference DNA motif databases (such as JASPAR, TRANSFAC, and Uniprobe) in order to validate/identify predictions. Abstract High-throughput sequencing experiments, such as ChIP-seq, identify protein interactions within large samples of DNA. Such experiments may identify and output thousands of peak sequences where an interaction of interest is regulated by that region of DNA. Computational biologists often require a de novo analysis of these peaks to mine short overrepresented patterns, frequently found amongst the sequences. This task is referred to as motif discovery. Existing tools struggle at motif discovery if the number of input sequences increases. The time and complexity costs of calculating pattern occurrence often requires compromises on runtime, quality, or length of the output predictions. [1] Chen, X. et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106–1117 (2008). [2] Bailey, T. L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics 27, 1653–1659 (2011). [3] Fortin, F.-A. & Parizeau, M. Revisiting the NSGA-II crowding-distance computation. in Proceedings of the 15th annual conference on Genetic and evolutionary computation 623–630 (ACM, 2013). [4] Fortin, F.-A., De Rainville, F.-M., Gardner, M.-A. G., Parizeau, M. & Gagné, C. DEAP: Evolutionary Algorithms Made Easy. J. Mach. Learn. Res. 13, 2171–2175 (2012) [5] Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).. References MA0139.1 YDSYGSCVYSTDSTRDBCV Tomtom 13.03.15 17:31 0 1 2 bits 1 G C T 2 T G A 3 T C G 4 T C 5 A T G 6 T A C 7 C 8 T A C 9 T C 10 C 11 A C T 12 T A G 13 T C G 14 C A T 15 G 16 C A T G 17 A T C 18 T A C 19 C T G A 0 1 2 bits 1 C T 2 A G T 3 C G 4 C T 5 G 6 C G 7 C 8 A C G 9 C T 10 C G 11 T 12 A G T 13 C G 14 T 15 A G 16 A G T 17 C G T 18 C 19 A C G MA0139.1 AGRKGGCR Tomtom 13.03.15 13:33 0 1 2 bits 1 G A C T 2 A T G 3 T A G 4 G T A C 5 C 6 G T A 7 A G C 8 A T C 9 T G A10 G11A G 12 A T G 13 G 14 A T G 15 T A C 16 A G 17 A G C 18 A C T 19 C G A 0 1 2 bits 1 A 2 G 3 A G 4 T G 5 G 6 G7 C8 G A MOTIFGP DREME CTCF MotifGP's motif fully overlaps the 19 bases of CTCF. Each tool's prediction targeted a different strand (orientation), showing a complementation between each methods. E-value: 6.13471E-09 E-value: 3.50248E-07 MA0145.2 WCCVGHTHVDDCCDG Tomtom 13.03.15 17:30 0 1 2 bits 1 G C 2 A T C 3 T G A 4 C G 5 A C T 6 G A C T 7 T C 8 9 C T G A 10 T G A 11 G C 12 A T C 13 T G A 14 C G 0 1 2 bits 1 A T 2 C 3 C 4 A C G 5 G 6 A C T 7 T 8 A C T 9 A C G 10 A G T 11 A G T 12 C 13 C 14 A G T 15 G MA0145.2 ADCCRG Tomtom 13.03.15 13:33 0 1 2 bits 1G C2 A T C 3 T G A 4 C G 5 A C T 6 G A C T 7 T C 8 9 C T G A 10 T G A 11 G C 12 A T C 13 T G A 14 C G 0 1 2 bits 1 A 2 T A G 3 C 4 C 5 G A 6 G Tcfcp2l1 MOTIFGP DREME The Tcfcp2l1 motif has two fragments joined by a weak gap (position 7 to 8). Evolutionary computing allows such fragments of a motif to evolve independently and eventually crossover, allowing short gaps to naturally appear in produced predictions. E-value: 1.33929E-07 E-value: 3.07791E-03 MA0039.2 GGGYGKGK Tomtom 13.03.15 13:33 0 1 2 bits 1 G A T 2 A G 3 G 4 G 5 A C T 6 G 7 T G 8 T A G 9 A C T G 10 A G T C 0 1 2 bits 1 G 2 G 3 G 4 C T 5 G 6 G T 7 G 8 T G MA0039.2 GYGDGGGYGKGGC Tomtom 13.03.15 14:20 0 1 2 bits 1 G A T 2 A G 3 G 4 G 5 A C T 6 G 7 T G 8 T A G 9 A C T G 10 A G T C 0 1 2 bits 1 G 2 C T 3 G 4 A G T 5 G6 G7 G8C T 9 G10 G T 11 G 12 G 13 C MOTIFGP DREME Klf4 As shown by the Klf4 dataset prediction, MotifGP has the ability to overlap an entire target motif. It even finds additional nucleotides attached to the left end of the predicted motif, found in the input sequences.E-value: 1.18496E-09 E-value: 3.28351E-08 Experimental Results A sample of top ranking motifs identified by MotifGP are compared to DREME's prediction on the same datasets. Each top logo is the target motif from the database, aligned by TOMTOM. The bottom logos are the predictions by MotifGP (at the left), and DREME, (at the right). Benchmarking For each dataset, MotifGP was executed 10 times with a different random initialization. The E-value of the top DREME motif alignment is used as the benchmark value for MotifGP. Over the 10 individual runs of each dataset, we counted the number of runs where a prediction outperformed the benchmark. We also looked at the runs that are within up to 3 orders of magnitude of the benchmark E-Value. This experiment shows the algorithm performed favorably in term of stability in 11 out of 13 datasets, including 6 where MotifGP suceeded in all 10 runs. Dataset < Terminology Motif: A recurring feature in a set of elements, such as frequent patterns in DNA strings. Transcription Factor (TF): Proteins that bind to DNA at specific sites. Network Expressions: Subset of the regular expression grammar. Defines patterns in which each position has a set of possible characters. DNA Motif Discovery: Mining DNA sequences for motifs. Can be used to identify TF binding sites. To determine if MotifGP can outperform DREME, we compared both software against 13 datasets from mouse Embryonic Stem Cells ChIP-Seq experiments [1]. The recorded runtimes for each DREME run were used as time limits for MotifGP. Each dataset may contain one or more factor motif. Desirable predictions should report long overlap and the lowest E-values once aligned to known motifs. MethodologyStrongly Typed Genetic Programming AAC[CT] + + A A [ ]C TC Strongly Typed Network Expression grammar Genetic Programming is a search heuristic where a population of programs created under a template grammar are evolved over generations of crossovers and mutations. MotifGP uses a grammar to produce Network Expressions that can be matched against sequences. In population based evolutionary algorithms, new individuals are created over generations and evaluated by a fitness function. This scores the individuals in how well they perform as solutions. In multi-objective optimization, the individuals have more than one fitness values. Fitnesses with multiple values are compared and ranked by a multi-objective optimization algorithm [3]. MotifGP matches candidate solutions (network expressions) against a set of input and control sequences. We use the match count on each set to compute two metrics of discrimination and coverage: For a given pattern, let P be the set of matched input sequences, C be the set of matched control sequences, and k be the total number of input sequences. Discrimination favors solutions that are not highly represented in the control sequences, while Coverage maintains solutions that are found in multiple input sequences. Multi-objective optimization Conclusions and Future Work Combining STGP and multi-objective optimization directs the solution search towards motifs that align with high confidence and fully overlaps known motifs from databases. Other existing ChIP-seq experiments can be revisited using MotifGP as it may be able to uncover subtle patterns that other tools could not. Future development of MotifGP will investigate different multi-objective fitness function variants and explore other evolutionary methods. While DREME predictions have significant E-values, the pattern length is limited to 8 characters. Some known motifs are more than twice the size DREME can predict. This is a restriction set to keep low runtimes under real ChIP-seq datasets. We propose our software, MotifGP, as an alternative to this limitation. Our goal is to explore the potential of multi-objective evolutionary computing in finding significant motif predictions on large datasets, under runtimes comparable to DREME. GGTAGAAATA TTAAAACCT ATAAACCT AAAACCT MotifGP DREME Alignment TOMTOM Motif discovery OR A[AT]A A[AC][CT] AAC[CT] ... ... DNA sequences Motif predictions JASPAR Uniprobe Motif databases Alignment score MA0039.2 GYGDGGGYGKGGC Tomtom 13.03.15 14:20 0 1 2 bits 1 G A T 2 A G 3 G 4 G 5 A C T 6 G 7 T G 8 T A G 9 A C T G 10 A G T C 0 1 2 bits 1 G 2 C T 3 G 4 A G T 5 G 6 G 7 G 8 C T 9 G 10 G T 11 G 12 G 13 C E-value:1.18E-08 Target: Klf4 (MA0039.2) ... De novo motif discovery pipeline Background and Motivation A state-of-the-art motif discovery tool, DREME, predicts DNA motifs by mining and refining k- mer (words) from an input set of sequences [2]. The predictions are compared to known motif databases with TOMTOM [5]. Reported alignments are listed and scored by E-value.