An Evolution algorithm approach for Feature
generation from Sequence data and its
application to DNA splice data prediction
Authors: Uday Kamath, Kenneth A. De Jong, Amarda Shehu,
Jack Comton, Rezarta Islamaj-Dogan.
Source: IEEE/ACM Transactions on Computational Biology and Bioinformatics,
Vol. 9, No. 5, September/October 2012.
Presenter: Nguyen Dinh Chien (阮庭戰)
1
An EA approach for FG from Sequence data and
its application to DNA splice data prediction
 A challenge for machine learning methods is associating functional
information with biological sequence. Their performance often depends
on deriving predictive features from sequence sought to be classified.
 Feature generation is a difficult problem. It is often the task of domain
experts or exhaustive feature enumeration techniques to generate a few
features. Their predictive power is tested in context of classification.
 Therefore, the authors proposed an evolution algorithm to effectively
explore a large feature space and generate predictive features from
sequence data.
2
Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
3
Introduction
 A variety of general purpose search techniques are effective for NP-
Hard problems.
 In this paper, they explored the use of evolutionary algorithms to search
a large and complex features space, with the goal is to obtain features
from sequence data that can significantly improve the classification
accuracy of a Support Vector Machine (SVM). This approach is
evaluated on the difficult problem of DNA splice site prediction.
 They used Genetic Programming (GP) techniques to evolve the kind of
structures. This approach is called FG-EA, means Feature Generation with
Evolution Algorithm.
4
Introduction
5
 Using an efficient fitness function, they identified a set of candidate features (a
hall of fame) to be used as input to a standard SVM classification procedure.
 Comparison with state-of-the-
art feature-based classification
method, they realized that FG-
EA features significantly
improve the classification
performance.
The DNA splice site prediction problem
 Transcription of a eukaryotic DNA sequence into messenger RNA (mRNA)
occurs only after enzymes splice away nocoding regions (intron) from
precursor (pre-mRNA) sequence to leave only coding regions (exons), so that,
prediction of splice sites is a fundamental component of the gene-finding
problem.
 Splice site prediction is a difficult problem. AG and GT (GU) can not be used
as features due to their abundance in non-splice site sequences.
6
Related EA work
 Many studies have demonstrated the advantages of EAs for feature generation in
different domains, such as, Fast genetic selection (F.A Brill et al. 1992), Nearest
neighbor classifier (L.I Kucheva and L.C Jain. 1999), Dimensionality reduction
(M.L Raymer et al. 2007), …
 All of above methods obtain predictive
features from sequence data have
shown success in diverse
bioinformatics problems.
 Work on predicting enzymatic activity
in proteins additionally shows the
power of EAs in feature generation.
7
Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
8
Methods
9
The FG-EA algorithm
10
Feature representation
 In the FG-EA, the leaves of a parse tree, also referred as a terminals, are either
characters from the DNA alphabet {A,C,G,T} or integers corresponding to
positions or motif (k-mers) length; the internal nodes are operators Length,
Position, Motif, Matches, MatchAsPosition, AND, OR, and NOT.
positional featurescompositional features
11
Generating features
 Generating random initial features from 0 consists of N=15,000
features.
 The tree representing features are generated using the well-known
ramped half-and-half generative method, which includes both Full
and Grow techniques. These techniques obtain a mixture of full-
balanced trees and bushy trees with each technique is employed
with equal probability of 0.5.
 Subsequent generations are evolved using standard GP (Genetic
Programming) selection, crossover, and mutation mechanisms.
12
Genetic Operators
 Given a set of m features
extracted from the hall of
fame to serve as parents in a
generation, the rest of
GenSize-m features are
generated using the mutation
and crossover operators.
 Employed three breeding
pipelines:
 mutation
 mutation-ERC
 crossover
13
Fitness function
 The fitness function is key to achieving an efficient and effective
EA search heuristic.
 FG-EA uses a surrogate fitness function given by:
 f – feature; the ratio C+,f/C+ is weighted by the information gain (IG)
 IG is often employed as a criterion of a feature’s goodness in machine
learning. Given m class attributes:
)(*)(
,
fIG
C
C
fFitness
f



   

m
i ii
m
i i
m
i iii fcPfcPfPfcPfcpfPcPcPfIG 11 1
))|(log().|(.)())|(log().|().())(log().()(
14
Post FG-EA feature selection
 The set of features in the hall of fame can be further narrowed
through Recursive Feature Elimination (RFE). RFE start with a
large feature set and gradually reduce this set by removing the
least successful features until a stopping criterion is met.
 They employed RFE to estimate the impact of feature set sizes on
the precision and accuracy of the classification, and directly
compare with existing work.
15
Support vector machines as classifier
 SVMs is popular and successful in a wide variety of binary
classification problems.
 In this paper, they use it with three steps:
 Map sequence data into a Euclidean vector space;
 Select a kernel function to map the vector space into higher dimensional and more
effective Euclidean space;
 Turn parameters for the kernel and other SVM parameters to improve performance.
16
Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
17
Data sets
 Compare the classification performance to two different groups of state-of-the-
art methods in splice site prediction, feature-based, and kernel-based.
 Feature-based: (FGA and GeneSplicer) extracted from the 2005 NCBI RefSeq
collection of 5057 human pre-mRNA sequences (http://www.ncbi.nlm.nih.gov)
 Used to extract 51008 positive (contain splice sites) and 200000 negative sequences
 25504 acceptor and 25504 donor consist of 162 nucleotides each (80 upstream + AG|GT +
80 downstream)
 Kernel-based: (WD and WDS) extracted from the worm data set with EST
sequences (http://www.wormbase.org).
 Using 64844 donor and 64838 acceptor splice site sequences. Each sequence is 142
nucleotides long (60+AG|GT+80)
 1,777,912 sequences are centered around nonsplice site AG dinucleotides, and
2,846,598 sequences are centered around nonsplice site GT dinucleotides.
18
Overview of conducted experiments
 Two sets of classification experiments are conducted
 Compare the performance of FG-EA to FGA and GeneSplicer
 The SVM is trained over 2/3 of the data tested on the remaining 1/3.
 These process is repeated three times to obtain an average performance, with 30 difference
sets of hall of fame features.
 The trained SVM is applied to classify the B2hum testing data set.
 Compare with WD and WDS methods
 Employ 30 independent runs of FG-EA and SVM evaluation of resulting features.
 40,000 sequences are sampled from the worm data set.
 Ten different subsets of 360,000 sequences are randomly sampled from the worm data set.
 The values of these parameters can be found on our website
(http://www.cs.gmu.edu/~ashehu/?q=OurTools).
19
Overview of conducted experiments
 They measure performance in terms of 11ptAVG, FPR, auROC, and auPRC.
 The 11ptAVG is the average of the precisions calculated at 11 recall values
{0%,10%,…,100%}
 PRCs are employed to show the ability of FG-EA to discriminate true splice sites from
other sequences.
 FPR is also computed for recall values by varying the confidence threshold to employ
FPR-recall curves and show that FG-EA make very few mistakes.
Performance measurements
20
Evaluation of fitness quality and convergence
Mean and maximum fitness values per generation (left: acceptor, right: donor) are averaged over
30 independent GP runs. Error bars are standard deviations
21
Performance on Human training data set
Precision versus recall on training data set
Precision values are plotted over recall points (left: acceptor, right: donor). Values are averages
over 30 FG-EA run. Error bars are standard deviations.
22
Performance on B2Hum testing data set
Precision versus recall on testing data set
Precision over recall (left: acceptor, right: donor) are plotted for the B2Hum testing dataset
23
Performance on worm training data set
Precision over recall (left: acceptor, right: donor) are plotted for the 40K subset sampled from the
worm training data set
24
Performance on worm training data set
25
Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
26
Discussion
 They divide the hall of fame
features in three types of subsets
 All composition features
 All region-specific compositional,
positional, correlational features
 All remaining features include
conjunctive and disjunctive
features
27
Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
28
Conclussions
 FG-EA outperforms state-of-the-art feature generation methods in splice site
classification.
 FG-EA reveals the significant role of novel complex conjunctive and
disjunctive features.
 The proposed FG-EA algorithm can easily be employed in other prediction
problems on biological sequences.
 Further extensions of the FG-EA can combine the evolution of features with
evolution of SVM kernels for greater classification accuracy.
 Plan on employing regular expressions to further combine and reduce the bloat
in the expressions and so improve readability and performance.
29
 Evolutionary algorithms and genetic programming was widely-used in
bioinformatics, computational sciences, economics, chemistry and other fields.
However, they also have some limitations as,
 Operating on dynamic data sets is difficult.
 Cannot effectively solve problems in which the only fitness measure is a single
right/wrong measure.
 In some cases, the optimization algorithm can be more effective than the evolutionary
algorithm.
 Therefore, I think that we can combine FG-EA algorithm with optimization
algorithms to get better results in DNA splice site prediction problem.
30
Thank you for your listening!

P0126557 slides

  • 1.
    An Evolution algorithmapproach for Feature generation from Sequence data and its application to DNA splice data prediction Authors: Uday Kamath, Kenneth A. De Jong, Amarda Shehu, Jack Comton, Rezarta Islamaj-Dogan. Source: IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 9, No. 5, September/October 2012. Presenter: Nguyen Dinh Chien (阮庭戰) 1
  • 2.
    An EA approachfor FG from Sequence data and its application to DNA splice data prediction  A challenge for machine learning methods is associating functional information with biological sequence. Their performance often depends on deriving predictive features from sequence sought to be classified.  Feature generation is a difficult problem. It is often the task of domain experts or exhaustive feature enumeration techniques to generate a few features. Their predictive power is tested in context of classification.  Therefore, the authors proposed an evolution algorithm to effectively explore a large feature space and generate predictive features from sequence data. 2
  • 3.
    Outline  Introduction  Methods Materials  Discussion  Conclusions 3
  • 4.
    Introduction  A varietyof general purpose search techniques are effective for NP- Hard problems.  In this paper, they explored the use of evolutionary algorithms to search a large and complex features space, with the goal is to obtain features from sequence data that can significantly improve the classification accuracy of a Support Vector Machine (SVM). This approach is evaluated on the difficult problem of DNA splice site prediction.  They used Genetic Programming (GP) techniques to evolve the kind of structures. This approach is called FG-EA, means Feature Generation with Evolution Algorithm. 4
  • 5.
    Introduction 5  Using anefficient fitness function, they identified a set of candidate features (a hall of fame) to be used as input to a standard SVM classification procedure.  Comparison with state-of-the- art feature-based classification method, they realized that FG- EA features significantly improve the classification performance.
  • 6.
    The DNA splicesite prediction problem  Transcription of a eukaryotic DNA sequence into messenger RNA (mRNA) occurs only after enzymes splice away nocoding regions (intron) from precursor (pre-mRNA) sequence to leave only coding regions (exons), so that, prediction of splice sites is a fundamental component of the gene-finding problem.  Splice site prediction is a difficult problem. AG and GT (GU) can not be used as features due to their abundance in non-splice site sequences. 6
  • 7.
    Related EA work Many studies have demonstrated the advantages of EAs for feature generation in different domains, such as, Fast genetic selection (F.A Brill et al. 1992), Nearest neighbor classifier (L.I Kucheva and L.C Jain. 1999), Dimensionality reduction (M.L Raymer et al. 2007), …  All of above methods obtain predictive features from sequence data have shown success in diverse bioinformatics problems.  Work on predicting enzymatic activity in proteins additionally shows the power of EAs in feature generation. 7
  • 8.
    Outline  Introduction  Methods Materials  Discussion  Conclusions 8
  • 9.
  • 10.
  • 11.
    Feature representation  Inthe FG-EA, the leaves of a parse tree, also referred as a terminals, are either characters from the DNA alphabet {A,C,G,T} or integers corresponding to positions or motif (k-mers) length; the internal nodes are operators Length, Position, Motif, Matches, MatchAsPosition, AND, OR, and NOT. positional featurescompositional features 11
  • 12.
    Generating features  Generatingrandom initial features from 0 consists of N=15,000 features.  The tree representing features are generated using the well-known ramped half-and-half generative method, which includes both Full and Grow techniques. These techniques obtain a mixture of full- balanced trees and bushy trees with each technique is employed with equal probability of 0.5.  Subsequent generations are evolved using standard GP (Genetic Programming) selection, crossover, and mutation mechanisms. 12
  • 13.
    Genetic Operators  Givena set of m features extracted from the hall of fame to serve as parents in a generation, the rest of GenSize-m features are generated using the mutation and crossover operators.  Employed three breeding pipelines:  mutation  mutation-ERC  crossover 13
  • 14.
    Fitness function  Thefitness function is key to achieving an efficient and effective EA search heuristic.  FG-EA uses a surrogate fitness function given by:  f – feature; the ratio C+,f/C+ is weighted by the information gain (IG)  IG is often employed as a criterion of a feature’s goodness in machine learning. Given m class attributes: )(*)( , fIG C C fFitness f         m i ii m i i m i iii fcPfcPfPfcPfcpfPcPcPfIG 11 1 ))|(log().|(.)())|(log().|().())(log().()( 14
  • 15.
    Post FG-EA featureselection  The set of features in the hall of fame can be further narrowed through Recursive Feature Elimination (RFE). RFE start with a large feature set and gradually reduce this set by removing the least successful features until a stopping criterion is met.  They employed RFE to estimate the impact of feature set sizes on the precision and accuracy of the classification, and directly compare with existing work. 15
  • 16.
    Support vector machinesas classifier  SVMs is popular and successful in a wide variety of binary classification problems.  In this paper, they use it with three steps:  Map sequence data into a Euclidean vector space;  Select a kernel function to map the vector space into higher dimensional and more effective Euclidean space;  Turn parameters for the kernel and other SVM parameters to improve performance. 16
  • 17.
    Outline  Introduction  Methods Materials  Discussion  Conclusions 17
  • 18.
    Data sets  Comparethe classification performance to two different groups of state-of-the- art methods in splice site prediction, feature-based, and kernel-based.  Feature-based: (FGA and GeneSplicer) extracted from the 2005 NCBI RefSeq collection of 5057 human pre-mRNA sequences (http://www.ncbi.nlm.nih.gov)  Used to extract 51008 positive (contain splice sites) and 200000 negative sequences  25504 acceptor and 25504 donor consist of 162 nucleotides each (80 upstream + AG|GT + 80 downstream)  Kernel-based: (WD and WDS) extracted from the worm data set with EST sequences (http://www.wormbase.org).  Using 64844 donor and 64838 acceptor splice site sequences. Each sequence is 142 nucleotides long (60+AG|GT+80)  1,777,912 sequences are centered around nonsplice site AG dinucleotides, and 2,846,598 sequences are centered around nonsplice site GT dinucleotides. 18
  • 19.
    Overview of conductedexperiments  Two sets of classification experiments are conducted  Compare the performance of FG-EA to FGA and GeneSplicer  The SVM is trained over 2/3 of the data tested on the remaining 1/3.  These process is repeated three times to obtain an average performance, with 30 difference sets of hall of fame features.  The trained SVM is applied to classify the B2hum testing data set.  Compare with WD and WDS methods  Employ 30 independent runs of FG-EA and SVM evaluation of resulting features.  40,000 sequences are sampled from the worm data set.  Ten different subsets of 360,000 sequences are randomly sampled from the worm data set.  The values of these parameters can be found on our website (http://www.cs.gmu.edu/~ashehu/?q=OurTools). 19
  • 20.
    Overview of conductedexperiments  They measure performance in terms of 11ptAVG, FPR, auROC, and auPRC.  The 11ptAVG is the average of the precisions calculated at 11 recall values {0%,10%,…,100%}  PRCs are employed to show the ability of FG-EA to discriminate true splice sites from other sequences.  FPR is also computed for recall values by varying the confidence threshold to employ FPR-recall curves and show that FG-EA make very few mistakes. Performance measurements 20
  • 21.
    Evaluation of fitnessquality and convergence Mean and maximum fitness values per generation (left: acceptor, right: donor) are averaged over 30 independent GP runs. Error bars are standard deviations 21
  • 22.
    Performance on Humantraining data set Precision versus recall on training data set Precision values are plotted over recall points (left: acceptor, right: donor). Values are averages over 30 FG-EA run. Error bars are standard deviations. 22
  • 23.
    Performance on B2Humtesting data set Precision versus recall on testing data set Precision over recall (left: acceptor, right: donor) are plotted for the B2Hum testing dataset 23
  • 24.
    Performance on wormtraining data set Precision over recall (left: acceptor, right: donor) are plotted for the 40K subset sampled from the worm training data set 24
  • 25.
    Performance on wormtraining data set 25
  • 26.
    Outline  Introduction  Methods Materials  Discussion  Conclusions 26
  • 27.
    Discussion  They dividethe hall of fame features in three types of subsets  All composition features  All region-specific compositional, positional, correlational features  All remaining features include conjunctive and disjunctive features 27
  • 28.
    Outline  Introduction  Methods Materials  Discussion  Conclusions 28
  • 29.
    Conclussions  FG-EA outperformsstate-of-the-art feature generation methods in splice site classification.  FG-EA reveals the significant role of novel complex conjunctive and disjunctive features.  The proposed FG-EA algorithm can easily be employed in other prediction problems on biological sequences.  Further extensions of the FG-EA can combine the evolution of features with evolution of SVM kernels for greater classification accuracy.  Plan on employing regular expressions to further combine and reduce the bloat in the expressions and so improve readability and performance. 29
  • 30.
     Evolutionary algorithmsand genetic programming was widely-used in bioinformatics, computational sciences, economics, chemistry and other fields. However, they also have some limitations as,  Operating on dynamic data sets is difficult.  Cannot effectively solve problems in which the only fitness measure is a single right/wrong measure.  In some cases, the optimization algorithm can be more effective than the evolutionary algorithm.  Therefore, I think that we can combine FG-EA algorithm with optimization algorithms to get better results in DNA splice site prediction problem. 30
  • 31.
    Thank you foryour listening!