P0126557 slides


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

P0126557 slides

  1. 1. An Evolution algorithm approach for Featuregeneration from Sequence data and itsapplication to DNA splice data predictionAuthors: Uday Kamath, Kenneth A. De Jong, Amarda Shehu,Jack Comton, Rezarta Islamaj-Dogan.Source: IEEE/ACM Transactions on Computational Biology and Bioinformatics,Vol. 9, No. 5, September/October 2012.Presenter: Nguyen Dinh Chien (阮庭戰)1
  2. 2. An EA approach for FG from Sequence data andits application to DNA splice data prediction A challenge for machine learning methods is associating functionalinformation with biological sequence. Their performance often dependson deriving predictive features from sequence sought to be classified. Feature generation is a difficult problem. It is often the task of domainexperts or exhaustive feature enumeration techniques to generate a fewfeatures. Their predictive power is tested in context of classification. Therefore, the authors proposed an evolution algorithm to effectivelyexplore a large feature space and generate predictive features fromsequence data.2
  3. 3. Outline Introduction Methods Materials Discussion Conclusions3
  4. 4. Introduction A variety of general purpose search techniques are effective for NP-Hard problems. In this paper, they explored the use of evolutionary algorithms to searcha large and complex features space, with the goal is to obtain featuresfrom sequence data that can significantly improve the classificationaccuracy of a Support Vector Machine (SVM). This approach isevaluated on the difficult problem of DNA splice site prediction. They used Genetic Programming (GP) techniques to evolve the kind ofstructures. This approach is called FG-EA, means Feature Generation withEvolution Algorithm.4
  5. 5. Introduction5 Using an efficient fitness function, they identified a set of candidate features (ahall of fame) to be used as input to a standard SVM classification procedure. Comparison with state-of-the-art feature-based classificationmethod, they realized that FG-EA features significantlyimprove the classificationperformance.
  6. 6. The DNA splice site prediction problem Transcription of a eukaryotic DNA sequence into messenger RNA (mRNA)occurs only after enzymes splice away nocoding regions (intron) fromprecursor (pre-mRNA) sequence to leave only coding regions (exons), so that,prediction of splice sites is a fundamental component of the gene-findingproblem. Splice site prediction is a difficult problem. AG and GT (GU) can not be usedas features due to their abundance in non-splice site sequences.6
  7. 7. Related EA work Many studies have demonstrated the advantages of EAs for feature generation indifferent domains, such as, Fast genetic selection (F.A Brill et al. 1992), Nearestneighbor classifier (L.I Kucheva and L.C Jain. 1999), Dimensionality reduction(M.L Raymer et al. 2007), … All of above methods obtain predictivefeatures from sequence data haveshown success in diversebioinformatics problems. Work on predicting enzymatic activityin proteins additionally shows thepower of EAs in feature generation.7
  8. 8. Outline Introduction Methods Materials Discussion Conclusions8
  9. 9. Methods9
  10. 10. The FG-EA algorithm10
  11. 11. Feature representation In the FG-EA, the leaves of a parse tree, also referred as a terminals, are eithercharacters from the DNA alphabet {A,C,G,T} or integers corresponding topositions or motif (k-mers) length; the internal nodes are operators Length,Position, Motif, Matches, MatchAsPosition, AND, OR, and NOT.positional featurescompositional features11
  12. 12. Generating features Generating random initial features from 0 consists of N=15,000features. The tree representing features are generated using the well-knownramped half-and-half generative method, which includes both Fulland Grow techniques. These techniques obtain a mixture of full-balanced trees and bushy trees with each technique is employedwith equal probability of 0.5. Subsequent generations are evolved using standard GP (GeneticProgramming) selection, crossover, and mutation mechanisms.12
  13. 13. Genetic Operators Given a set of m featuresextracted from the hall offame to serve as parents in ageneration, the rest ofGenSize-m features aregenerated using the mutationand crossover operators. Employed three breedingpipelines: mutation mutation-ERC crossover13
  14. 14. Fitness function The fitness function is key to achieving an efficient and effectiveEA search heuristic. FG-EA uses a surrogate fitness function given by: f – feature; the ratio C+,f/C+ is weighted by the information gain (IG) IG is often employed as a criterion of a feature’s goodness in machinelearning. Given m class attributes:)(*)(,fIGCCfFitnessf   mi iimi imi iii fcPfcPfPfcPfcpfPcPcPfIG 11 1))|(log().|(.)())|(log().|().())(log().()(14
  15. 15. Post FG-EA feature selection The set of features in the hall of fame can be further narrowedthrough Recursive Feature Elimination (RFE). RFE start with alarge feature set and gradually reduce this set by removing theleast successful features until a stopping criterion is met. They employed RFE to estimate the impact of feature set sizes onthe precision and accuracy of the classification, and directlycompare with existing work.15
  16. 16. Support vector machines as classifier SVMs is popular and successful in a wide variety of binaryclassification problems. In this paper, they use it with three steps: Map sequence data into a Euclidean vector space; Select a kernel function to map the vector space into higher dimensional and moreeffective Euclidean space; Turn parameters for the kernel and other SVM parameters to improve performance.16
  17. 17. Outline Introduction Methods Materials Discussion Conclusions17
  18. 18. Data sets Compare the classification performance to two different groups of state-of-the-art methods in splice site prediction, feature-based, and kernel-based. Feature-based: (FGA and GeneSplicer) extracted from the 2005 NCBI RefSeqcollection of 5057 human pre-mRNA sequences (http://www.ncbi.nlm.nih.gov) Used to extract 51008 positive (contain splice sites) and 200000 negative sequences 25504 acceptor and 25504 donor consist of 162 nucleotides each (80 upstream + AG|GT +80 downstream) Kernel-based: (WD and WDS) extracted from the worm data set with ESTsequences (http://www.wormbase.org). Using 64844 donor and 64838 acceptor splice site sequences. Each sequence is 142nucleotides long (60+AG|GT+80) 1,777,912 sequences are centered around nonsplice site AG dinucleotides, and2,846,598 sequences are centered around nonsplice site GT dinucleotides.18
  19. 19. Overview of conducted experiments Two sets of classification experiments are conducted Compare the performance of FG-EA to FGA and GeneSplicer The SVM is trained over 2/3 of the data tested on the remaining 1/3. These process is repeated three times to obtain an average performance, with 30 differencesets of hall of fame features. The trained SVM is applied to classify the B2hum testing data set. Compare with WD and WDS methods Employ 30 independent runs of FG-EA and SVM evaluation of resulting features. 40,000 sequences are sampled from the worm data set. Ten different subsets of 360,000 sequences are randomly sampled from the worm data set. The values of these parameters can be found on our website(http://www.cs.gmu.edu/~ashehu/?q=OurTools).19
  20. 20. Overview of conducted experiments They measure performance in terms of 11ptAVG, FPR, auROC, and auPRC. The 11ptAVG is the average of the precisions calculated at 11 recall values{0%,10%,…,100%} PRCs are employed to show the ability of FG-EA to discriminate true splice sites fromother sequences. FPR is also computed for recall values by varying the confidence threshold to employFPR-recall curves and show that FG-EA make very few mistakes.Performance measurements20
  21. 21. Evaluation of fitness quality and convergenceMean and maximum fitness values per generation (left: acceptor, right: donor) are averaged over30 independent GP runs. Error bars are standard deviations21
  22. 22. Performance on Human training data setPrecision versus recall on training data setPrecision values are plotted over recall points (left: acceptor, right: donor). Values are averagesover 30 FG-EA run. Error bars are standard deviations.22
  23. 23. Performance on B2Hum testing data setPrecision versus recall on testing data setPrecision over recall (left: acceptor, right: donor) are plotted for the B2Hum testing dataset23
  24. 24. Performance on worm training data setPrecision over recall (left: acceptor, right: donor) are plotted for the 40K subset sampled from theworm training data set24
  25. 25. Performance on worm training data set25
  26. 26. Outline Introduction Methods Materials Discussion Conclusions26
  27. 27. Discussion They divide the hall of famefeatures in three types of subsets All composition features All region-specific compositional,positional, correlational features All remaining features includeconjunctive and disjunctivefeatures27
  28. 28. Outline Introduction Methods Materials Discussion Conclusions28
  29. 29. Conclussions FG-EA outperforms state-of-the-art feature generation methods in splice siteclassification. FG-EA reveals the significant role of novel complex conjunctive anddisjunctive features. The proposed FG-EA algorithm can easily be employed in other predictionproblems on biological sequences. Further extensions of the FG-EA can combine the evolution of features withevolution of SVM kernels for greater classification accuracy. Plan on employing regular expressions to further combine and reduce the bloatin the expressions and so improve readability and performance.29
  30. 30.  Evolutionary algorithms and genetic programming was widely-used inbioinformatics, computational sciences, economics, chemistry and other fields.However, they also have some limitations as, Operating on dynamic data sets is difficult. Cannot effectively solve problems in which the only fitness measure is a singleright/wrong measure. In some cases, the optimization algorithm can be more effective than the evolutionaryalgorithm. Therefore, I think that we can combine FG-EA algorithm with optimizationalgorithms to get better results in DNA splice site prediction problem.30
  31. 31. Thank you for your listening!