1Title: An Evolutionary algorithm approach for Feature generation from Sequence data and itsapplication to DNA Splice site prediction.Authors: Uday Kamath, Keneth A. De Jong, Amarda Shehu, Jack Compton, and Rezarta Islamaj-Dogan.Source: IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 9, No. 5,September/October 2012.Speaker: Nguyen Dinh Chien (阮庭戰), Student ID: P0126557.Sequence-based classification aims to discover signals or features hidden in the sequence data inprediction information. That sequence data correlate with the sought property a discriminate betweensequences that contain the property and those that do not. Reduction techniques, such as Informationgain, Chi-Square, Mutual Information, and KL-distance, are additionally employed to further reducethe size of feature set. Reducing is important to propose feature generation methods that are notlimited by biological insight, the considered type of feature or the ability to enumerate features.Transcription of an eukaryotic DNA sequence into mRNA occurs only after enzymes splice awaynocoding regions (intron) from pre-mRNA sequence to leave only coding regions (exons), so that,prediction of splice sites is a fundamental component of the gene-finding problem.Splice site prediction is a difficult problem. AG and GT (GU) cannot be used as features due totheir abundance in non-splice site sequences.Many studies have demonstrated the advantages of EAs for feature generation in different domains,such as, Fast Genetic Selection, Nearest Neighbor Classifier… All of these methods obtain predictivefeatures from sequence data have shown success in diverse bioinformatics problems. The commonstructure of Evolution algorithm (EA) is showed in following figure.In this study, they explored the use of evolutionary algorithms to search a large and complex featurespace. They obtained features from sequence data that can significantly improve the classificationaccuracy of a Support Vector Machine (SVM). This approach is called FG-EA, means FeatureGeneration with Evolution Algorithm, and they used Genetic Programming (GP) techniques to evolvethe kind of structures illustrated in following figure.
2Comparison with state-of-the-art feature-based classification method, they realized that FG-EAfeatures significantly improve the classification performance. The FG-EA algorithm generatescomplex features represented internally as GP trees and evaluates them on splice site training datausing a surrogate fitness function.- The features in the hall of fame transform input sequence data into features vectors.- SVM operating over the feature vectors finally allow evaluating the accuracy of the resultingclassifier.With this diagram, they employ to predict DNA splice sites. The top features obtained after theexploration of the feature space with FG-EA allow transforming input sequences into feature vectorson which a SVM classifier can then operate.The above diagram showed the main steps in FG-EA algorithm. Features/individuals are evolved untila maximum number Gen_Max of generations has been reached. The mutation and crossover operatorsdetailed bellow are employed to obtain new features in a generation. Top features of a generation arecontributed to a growing hall of fame which then in turn contributes randomly selected features to seedthe next generation.The tree represents the feature, “GTT with length=3 in position 30 AND GTT with length=3 inposition 36.” Using an efficient fitness function, they identified a set of candidate features (a hall offame) to be used as input to a standard SVM classification procedure.
3In this paper, the Authors proposed some type of features, such as, Compositional features andpositional features, Correlational features, Conjunctive and disjunctive featuresGenerating random initial features from 0 consists of N=15,000 features. The tree representingfeatures are generated using the well-known ramped half-and-half generative method, which includesboth Full and Grow techniques. These techniques obtain a mixture of full-balanced trees and bushytrees with each technique is employed with equal probability of 0.5. Subsequent generations areevolved using standard GP (Genetic Programming) selection, crossover, and mutation mechanisms.Given a set of m features extracted from the hall of fame to serve as parents in a generation, the rest ofGenSize-m features are generated using the mutation and crossover operators. They employed threebreeding pipelines: mutation-ERC, mutation, and Crossover.The fitness function is key to achieving an efficient and effective EA search heuristic. FG-EA uses asurrogate fitness function given by:)(*)(,fIGCCfFitnessff – feature; the ratio C+,f/C+ is weighted by the information gain (IG)Given m class attributes: mi iimi imi iii fcPfcPfPfcPfcpfPcPcPfIG 11 1))|(log().|(.)())|(log().|().())(log().()(The ℓ fittest individuals of a generation are added to a hall of fame, which keeps the fittest individualsof each generation. Maintaining a hall of fame guarantees that fit individuals will not be lost orchanged. They used hall of fame with two reasons, such as, Maintaining diversity in the solution space,and Guarantee optimal performance.In this study, they used ℓ=250 fittest individuals of a generation, and a generation seeds its populationwith m=100 randomly chosen individuals from the current set of features in the hall of fame.The set of features in the hall of fame can be further narrowed through Recursive Feature Elimination(RFE). RFE starts with a large feature set and gradually reduce this set by removing the least successfulfeatures until a stopping criterion is met. They employed RFE to estimate the impact of feature set sizeson the precision and accuracy of the classification, and directly compare with existing work.SVMs are popular and successful in a wide variety of binary classification problems. First, theymapped sequence data into a Euclidean vector space. Second, they selected a kernel function to mapthe vector space into higher dimensional and more effective Euclidean space. And final, they turnparameters for the kernel and other SVM parameters to improve performance.Compare the classification performance to two different groups of state-of-the-art methods in splicesite prediction, feature-based, and kernel-based. Feature-based: (FGA and GeneSplicer) extracted fromthe 2005 NCBI RefSeq collection of 5057 human pre-mRNA sequences(http://www.ncbi.nlm.nih.gov). Used to extract 51008 positive (contain splice sites) and 200000negative sequences, and 25504 acceptor and 25504 donor consist of 162 nucleotides each (80upstream + AG|GT + 80 downstream). Kernel-based: (WD and WDS) extracted from the worm dataset with EST sequences (http://www.wormbase.org). In this group, they used 64844 donors and 64838acceptor splice site sequences. Each sequence is 142 nucleotides long (60+AG|GT+80).
4There are two sets of classification experiments are conducted, such as, compare the performance ofFG-EA to FGA and GeneSplicer, and Compare with WD and WDS methods. The values of theseparameters can be found on the website http://www.cs.gmu.edu/~ashehu/?q=OurTools. They measureperformance in terms of 11ptAVG, FPR, auROC, and auPRC. For example, the following figure showthat FPR over recall (left: acceptor, right: donor) are plotted for the B2Hum testing data set.And, in two following table, the authors compared auROC values with auPRC values on 40KSequences sampled (left hand), and on 10 different sets of 360K sequences sampled (right hand) fromthe Worm data sets.They divide the hall of fame features in three types of subsets. First subset is all composition features;second subset is all region-specificcompositional, positional, correlationalfeatures; and third subset is all remainingfeatures include conjunctive anddisjunctive features. In the right-hand table,we can see that IG (information gain) sumsof subsets of features evaluated overacceptor and donor data.FG-EA outperforms state-of-the-art feature generation methods in splice site classification. FG-EAreveals the significant role of novel complex conjunctive and disjunctive features. The proposed FG-EA algorithm can easily be employed in other prediction problems on biological sequences. Furtherextensions of the FG-EA can combine the evolution of features with evolution of SVM kernels forgreater classification accuracy. Plan on employing regular expressions to further combine and reducethe bloat in the expressions and so improve readability and performance.