P0126557 slides

An Evolution algorithm approach for Feature
generation from Sequence data and its
application to DNA splice data prediction
Authors: Uday Kamath, Kenneth A. De Jong, Amarda Shehu,
Jack Comton, Rezarta Islamaj-Dogan.
Source: IEEE/ACM Transactions on Computational Biology and Bioinformatics,
Vol. 9, No. 5, September/October 2012.
Presenter: Nguyen Dinh Chien (阮庭戰)
1

An EA approach for FG from Sequence data and
its application to DNA splice data prediction
 A challenge for machine learning methods is associating functional
information with biological sequence. Their performance often depends
on deriving predictive features from sequence sought to be classified.
 Feature generation is a difficult problem. It is often the task of domain
experts or exhaustive feature enumeration techniques to generate a few
features. Their predictive power is tested in context of classification.
 Therefore, the authors proposed an evolution algorithm to effectively
explore a large feature space and generate predictive features from
sequence data.
2

Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
3

Introduction
 A variety of general purpose search techniques are effective for NP-
Hard problems.
 In this paper, they explored the use of evolutionary algorithms to search
a large and complex features space, with the goal is to obtain features
from sequence data that can significantly improve the classification
accuracy of a Support Vector Machine (SVM). This approach is
evaluated on the difficult problem of DNA splice site prediction.
 They used Genetic Programming (GP) techniques to evolve the kind of
structures. This approach is called FG-EA, means Feature Generation with
Evolution Algorithm.
4

Introduction
5
 Using an efficient fitness function, they identified a set of candidate features (a
hall of fame) to be used as input to a standard SVM classification procedure.
 Comparison with state-of-the-
art feature-based classification
method, they realized that FG-
EA features significantly
improve the classification
performance.

The DNA splice site prediction problem
 Transcription of a eukaryotic DNA sequence into messenger RNA (mRNA)
occurs only after enzymes splice away nocoding regions (intron) from
precursor (pre-mRNA) sequence to leave only coding regions (exons), so that,
prediction of splice sites is a fundamental component of the gene-finding
problem.
 Splice site prediction is a difficult problem. AG and GT (GU) can not be used
as features due to their abundance in non-splice site sequences.
6

Related EA work
 Many studies have demonstrated the advantages of EAs for feature generation in
different domains, such as, Fast genetic selection (F.A Brill et al. 1992), Nearest
neighbor classifier (L.I Kucheva and L.C Jain. 1999), Dimensionality reduction
(M.L Raymer et al. 2007), …
 All of above methods obtain predictive
features from sequence data have
shown success in diverse
bioinformatics problems.
 Work on predicting enzymatic activity
in proteins additionally shows the
power of EAs in feature generation.
7

Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
8

Feature representation
 In the FG-EA, the leaves of a parse tree, also referred as a terminals, are either
characters from the DNA alphabet {A,C,G,T} or integers corresponding to
positions or motif (k-mers) length; the internal nodes are operators Length,
Position, Motif, Matches, MatchAsPosition, AND, OR, and NOT.
positional featurescompositional features
11

Generating features
 Generating random initial features from 0 consists of N=15,000
features.
 The tree representing features are generated using the well-known
ramped half-and-half generative method, which includes both Full
and Grow techniques. These techniques obtain a mixture of full-
balanced trees and bushy trees with each technique is employed
with equal probability of 0.5.
 Subsequent generations are evolved using standard GP (Genetic
Programming) selection, crossover, and mutation mechanisms.
12

Genetic Operators
 Given a set of m features
extracted from the hall of
fame to serve as parents in a
generation, the rest of
GenSize-m features are
generated using the mutation
and crossover operators.
 Employed three breeding
pipelines:
 mutation
 mutation-ERC
 crossover
13

Fitness function
 The fitness function is key to achieving an efficient and effective
EA search heuristic.
 FG-EA uses a surrogate fitness function given by:
 f – feature; the ratio C+,f/C+ is weighted by the information gain (IG)
 IG is often employed as a criterion of a feature’s goodness in machine
learning. Given m class attributes:
)(*)(
,
fIG
C
C
fFitness
f



   

m
i ii
m
i i
m
i iii fcPfcPfPfcPfcpfPcPcPfIG 11 1
))|(log().|(.)())|(log().|().())(log().()(
14

Post FG-EA feature selection
 The set of features in the hall of fame can be further narrowed
through Recursive Feature Elimination (RFE). RFE start with a
large feature set and gradually reduce this set by removing the
least successful features until a stopping criterion is met.
 They employed RFE to estimate the impact of feature set sizes on
the precision and accuracy of the classification, and directly
compare with existing work.
15

Support vector machines as classifier
 SVMs is popular and successful in a wide variety of binary
classification problems.
 In this paper, they use it with three steps:
 Map sequence data into a Euclidean vector space;
 Select a kernel function to map the vector space into higher dimensional and more
effective Euclidean space;
 Turn parameters for the kernel and other SVM parameters to improve performance.
16

Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
17

Data sets
 Compare the classification performance to two different groups of state-of-the-
art methods in splice site prediction, feature-based, and kernel-based.
 Feature-based: (FGA and GeneSplicer) extracted from the 2005 NCBI RefSeq
collection of 5057 human pre-mRNA sequences (http://www.ncbi.nlm.nih.gov)
 Used to extract 51008 positive (contain splice sites) and 200000 negative sequences
 25504 acceptor and 25504 donor consist of 162 nucleotides each (80 upstream + AG|GT +
80 downstream)
 Kernel-based: (WD and WDS) extracted from the worm data set with EST
sequences (http://www.wormbase.org).
 Using 64844 donor and 64838 acceptor splice site sequences. Each sequence is 142
nucleotides long (60+AG|GT+80)
 1,777,912 sequences are centered around nonsplice site AG dinucleotides, and
2,846,598 sequences are centered around nonsplice site GT dinucleotides.
18

Overview of conducted experiments
 Two sets of classification experiments are conducted
 Compare the performance of FG-EA to FGA and GeneSplicer
 The SVM is trained over 2/3 of the data tested on the remaining 1/3.
 These process is repeated three times to obtain an average performance, with 30 difference
sets of hall of fame features.
 The trained SVM is applied to classify the B2hum testing data set.
 Compare with WD and WDS methods
 Employ 30 independent runs of FG-EA and SVM evaluation of resulting features.
 40,000 sequences are sampled from the worm data set.
 Ten different subsets of 360,000 sequences are randomly sampled from the worm data set.
 The values of these parameters can be found on our website
(http://www.cs.gmu.edu/~ashehu/?q=OurTools).
19

Overview of conducted experiments
 They measure performance in terms of 11ptAVG, FPR, auROC, and auPRC.
 The 11ptAVG is the average of the precisions calculated at 11 recall values
{0%,10%,…,100%}
 PRCs are employed to show the ability of FG-EA to discriminate true splice sites from
other sequences.
 FPR is also computed for recall values by varying the confidence threshold to employ
FPR-recall curves and show that FG-EA make very few mistakes.
Performance measurements
20

Evaluation of fitness quality and convergence
Mean and maximum fitness values per generation (left: acceptor, right: donor) are averaged over
30 independent GP runs. Error bars are standard deviations
21

Performance on Human training data set
Precision versus recall on training data set
Precision values are plotted over recall points (left: acceptor, right: donor). Values are averages
over 30 FG-EA run. Error bars are standard deviations.
22

Performance on B2Hum testing data set
Precision versus recall on testing data set
Precision over recall (left: acceptor, right: donor) are plotted for the B2Hum testing dataset
23

Performance on worm training data set
Precision over recall (left: acceptor, right: donor) are plotted for the 40K subset sampled from the
worm training data set
24

Performance on worm training data set
25

Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
26

Discussion
 They divide the hall of fame
features in three types of subsets
 All composition features
 All region-specific compositional,
positional, correlational features
 All remaining features include
conjunctive and disjunctive
features
27

Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
28

Conclussions
 FG-EA outperforms state-of-the-art feature generation methods in splice site
classification.
 FG-EA reveals the significant role of novel complex conjunctive and
disjunctive features.
 The proposed FG-EA algorithm can easily be employed in other prediction
problems on biological sequences.
 Further extensions of the FG-EA can combine the evolution of features with
evolution of SVM kernels for greater classification accuracy.
 Plan on employing regular expressions to further combine and reduce the bloat
in the expressions and so improve readability and performance.
29

 Evolutionary algorithms and genetic programming was widely-used in
bioinformatics, computational sciences, economics, chemistry and other fields.
However, they also have some limitations as,
 Operating on dynamic data sets is difficult.
 Cannot effectively solve problems in which the only fitness measure is a single
right/wrong measure.
 In some cases, the optimization algorithm can be more effective than the evolutionary
algorithm.
 Therefore, I think that we can combine FG-EA algorithm with optimization
algorithms to get better results in DNA splice site prediction problem.
30

P0126557 slides

More Related Content

Viewers also liked

Similar to P0126557 slides

Recently uploaded

P0126557 slides