The Challenge of Predicting Gene Function
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

The Challenge of Predicting Gene Function

  • 514 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
514
On Slideshare
511
From Embeds
3
Number of Embeds
1

Actions

Shares
Downloads
7
Comments
0
Likes
0

Embeds 3

http://www.slideshare.net 3

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10
  • 05/10/10

Transcript

  • 1. The Challenge of Predicting Gene Function
    • Ross D. King
    • Department of Computer Science
    • University of Wales, Aberystwyth
  • 2. Gene Function Prediction
    • The most important revelation from the sequenced genomes is that the functions of typically only between 60-70% of the predicted genes are known with any confidence.
    • The new science of functional genomics is dedicated to determining the function of the genes of unassigned function, and to further detailing the function of genes with purported function
  • 3. Data Mining Prediction
    • We have developed a method for predicting the functional class of gene products based on ILP/Relational data mining.
    • The idea is to learn a reliable predictive function on the examples of genes with products of known function.
    • Then apply this function to genes where the functional class is unknown.
    • We call this approach: Data Mining Prediction (DMP).
  • 4. Predicting Gene Function in Yeast
    • We will demonstrate our approach using ORFs in yeast
    • ( Saccharomyces cerevisiae ).
    • Using the MIPS functional classification scheme
    • For those ORFs whose function is currently unknown
    • Using 5 types of data:
    • Sequence statistics
    • Homology (sequence similarity)
    • Predicted Secondary Structure
    • Expression (microarray)
    • Phenotype
  • 5. We want to map from sequence to function class
  • 6. Classification Schemes 1
    • MIPS/GeneOntology
    1,0,0,0 "METABOLISM" 2,0,0,0 "ENERGY" 3,0,0,0 "CELL CYCLE AND DNA PROCESSING" 4,0,0,0 "TRANSCRIPTION" 5,0,0,0 "PROTEIN SYNTHESIS" 6,0,0,0 "PROTEIN FATE (folding, modification, destination)" 8,0,0,0 "CELLULAR TRANSPORT AND TRANSPORT MECHANISMS" 10,0,0,0 "CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM" 11,0,0,0 "CELL RESCUE, DEFENSE AND VIRULENCE" 13,0,0,0 "REGULATION OF / INTERACTION WITH CELLULAR ENVIRONMENT" 14,0,0,0 "CELL FATE" 29,0,0,0 "TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS" 30,0,0,0 "CONTROL OF CELLULAR ORGANIZATION" 40,0,0,0 "SUBCELLULAR LOCALISATION" 62,0,0,0 "PROTEIN ACTIVITY REGULATION" 63,0,0,0 "PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT " 67,0,0,0 "TRANSPORT FACILITATION" 98,0,0,0 "CLASSIFICATION NOT YET CLEAR-CUT" 99,0,0,0 "UNCLASSIFIED PROTEINS"
  • 7. Classification Schemes 2 1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid metabolism" 1,2,0,0 "nitrogen and sulfur metabolism" 1,3,0,0 "nucleotide metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0 "C-compound and carbohydrate metabolism" 1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism" 1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups" 1,20,0,0 "secondary metabolism" Hierarchy of classes
  • 8. Classification schemes 3 1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid metabolism" 1,1,1,0 "amino acid biosynthesis" 1,1,4,0 "regulation of amino acid metabolism" 1,1,7,0 "amino acid transport" 1,1,10,0 "amino acid degradation (catabolism)" 1,1,99,0 "other amino acid metabolism activities" 1,2,0,0 "nitrogen and sulfur metabolism" 1,3,0,0 "nucleotide metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0 "C-compound and carbohydrate metabolism" 1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism" 1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups" 1,20,0,0 "secondary metabolism" ... and ORFs may have multiple functions too! Hierarchy of classes
  • 9. Sequence Data 478 attributes in total field description type aa_rat_X % of amino acid X in the protein real seq_len length of the protein sequence int aa_rat_pair_X_Y % of the amino acids X and Y consecutively real mol_wt molecular weight of the protein int theo_pI theoretical pI (isoelectric point) real atomic_comp_X atomic composition of X (C,H,N,O,S) real aliphatic_index aliphatic index real hydro grand average of hydropathy real strand the DNA strand 'w' or 'c' position the number of exons (no. of start positions) int cai codon adaptation index real motifs number of PROSITE motifs int tmSpans number of transmembrane spans int chromosome chromosome number 1..16,mit
  • 10. Homology data YAL001C: mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdk.... sfc3: keyword(membrane) length(358) dbref(prosite) dbref(embl) We look up the associated information from SwissProt PSI-BLAST Sequence database NRDB gene tfc sfc3 wsv442 cg9463 f1l3 organism baker's yeast fission yeast white spot virus fruit fly Arabidopsis score 0.0 1.0e-18 2.1 2.9 3.0
  • 11. Predicted Secondary Structure Data mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdkkvk... cbbbbccaaaaaaaaaaaacccccbbbbaaaaaacccbbccccccb... We record length and relative positions of the secondary structure elements. This is relational data.
  • 12. Expression Data Spellman et al (1998), Roth et al (1998) DeRisi et al (1997), Eisen et al (1998) Gasch et al (2000, 2001), Chu et al (1998)
    • Microrarray experiments to measure expression changes in yeast under a variety of conditions, including cell cycle, heat shock, diauxic shift.
    • Short time series data, numerical-valued
     0  7  14  21 YBR166C 0.33 -0.17 0.04 -0.07 YOR357C -0.64 -0.38 -0.32 -0.29 YLR292C -0.23 0.19 -0.36 0.14 YGL112C -0.69 -0.89 -0.74 -0.56 ...
  • 13. Phenotype Data
    • Data from knockout gene growth experiments
    • Many missing data
    • 69 attributes x 1461 ORFs of known function
    • 991 genes of unknown function
    • Data taken from 3 sources (TRIPLES, MIPS, EUROFAN)
    s = sensitive (less growth) w = wild-type (no observable effect) r = resistant (more growth) n = no data ORF YAL001C YAL019W YAL021C YAL029C calcofluor white w n n n sorbitol n s n w benomyl n w n w ... deleted ORF growth medium H2O2 w w n r
  • 14. What are the Machine Learning Issues?
    • Large volume of data
    • Missing data
    • Accurate results required
    • Intelligible results required
    • Class hierarchy
    • Multiple labels
    • Relational data
  • 15. Relational vs Propositional Propositional: single table, fixed number of columns/attributes Relational: multiple tables, multiple values orf time0 time7 time14 yal001c 0.34 0.52 0.48 yal002w 0.76 0.82 0.89 yal003w 0.77 0.46 0.78 yal004c 0.38 0.50 0.49 orf SwissProtID e-val yal001c p03415 2e-4 yal001c p08640 8e-58 yal002w p32583 6e-52 yal002w p08775 3e-42 SwissProtID keyword p03415 apoptosis p03415 repeat p03415 zinc p08640 membrane
  • 16. Data Mining Prediction (DMP) Entire database Data for rule creation 2/3 1/3 2/3 1/3 PolyFARM C4.5 Rule gener- ation Select best rules Measure rule accuracy Validation data Training data All rules Best rules Test data Results
  • 17. Warmr
    • Warmr is an ILP Algorithm Developed by Dehaspe et al.
    • It is an ILP version of the well known Apriori data mining algorithm.
    • Designed to find frequent patterns in a datalog database.
  • 18. PolyFARM struc(Pos1, a) ^ neighbour(Pos1, Pos2, c) ^ neighbour(Pos2, Pos3, a) ^ coil_dist(high)
    • First-order association rule mining
    • Finding all frequent first order patterns in the data
    • Distributed on a Beowulf cluster
    • 47,034 homology patterns, f > 5%
    • 19,628 structure patterns, f > 2%
    • [Clare & King PADL 2003]
    hom(SPID, close) ^ sq_len(SPID, short) ^ classification(SPID, ecoli) A close homology to a short protein in E. coli Contains alpha-coil-alpha with a high overall coil distribution
  • 19. Propositionalisation patt1 patt2 patt3 patt4 ... patt47034 YAL001C 0 1 0 0 ... 1 YAL002W 0 1 1 0 ... 1 YAL003W 1 0 0 1 ... 0 YAL004W 1 1 0 0 ... 1 YAL005C 0 0 0 0 ... 1 ... Transforming relational data into boolean attributes
  • 20. Dichotomic Search 1
    • As an alternative to the WARMR data-mining approach, we developed a frequent pattern finding method based on dichotomic search.
    • This approach uses domain-specific logics as intermediates between propositional logic and predicate logic.
  • 21. Dichotomic Search 2
    • Most existing algorithms traverse the search space in either a top-down or a bottom-up fashion. We propose a new approach based on dichotomic search which explores the search space in both direction, allowing larger steps
    • Dichotomic search combines completeness (w.r.t. concepts), non-redundancy, and flexibility.
    • Ferre, S. & King, R.D. (2005). Fundamenta Informaticae
  • 22. Data Mining Prediction (DMP) Entire database Data for rule creation 2/3 1/3 2/3 1/3 PolyFARM C4.5 Rule gener- ation Select best rules Measure rule accuracy Validation data Training data All rules Best rules Test data Results
  • 23. C4.5
    • Open source decision tree algorithm
    • propositional learning
    • commonly used
    • produces interpretable rules
    • reliable
    • fast
    • accurate
    • Made modifications for:
    • multiple labels
    • hierarchical labels
    • [Clare & King Bioinformatics 2002]
    aa_ratio_pair_p_y strand aa_rat_a metabolism transport cell fate transcription >6.4 <=6.4 w c >0.232 <=0.232
  • 24. Data Mining Prediction (DMP) Entire database Data for rule creation 2/3 1/3 2/3 1/3 PolyFARM C4.5 Rule gener- ation Select best rules Measure rule accuracy Validation data Training data All rules Best rules Test data Results
  • 25. Results
    • Many rules from each data type
    • Rules at each level of hierarchy
    • Some classes are much easier to predict than others (for example &quot;protein synthesis&quot; at 71-93%, &quot;energy&quot; at 20-47%)
    • Good levels of accuracy on held out test data
    • Many predictions for ORFs of unknown function (some function at some level is predicted for 96% of the ORFs of unknown function)
    • Some rules explainable by biology -> scientific knowledge discovery
    • Clare & King (2003) Bioinformatics suppl. 2., 42-49
  • 26. Accuracy Table
  • 27. Expression Data Rule If in the micro-array experiment (sorbitol incubation) the ORF expression is > -0.25 and in the micro-array experiment (nitrogen depletion) the ORF expression is <= -1.29 and in the micro-array experiment (YPD stationary phase) the ORF expression is > -1.06 then the function of this ORF is ” pheromone response, mating type determination, sex-specific proteins&quot; Accuracy on training data: 11/12 (92%) Accuracy on the test data: 3/4 (75%) 21 predictions made
  • 28. Structure Rule
    • 80% accurate on test data
    • Most matching ORFs belong to the Mitochondrial Carrier Family
    • These have 6 long transmembrane alpha-helices of about 20-30 amino acids
    • Why do we notice alpha-helices of length 10-14?
    If true: coil (of length 3) followed by alpha (10 <= length < 14) and true: coil (of length 1 or 2) followed by alpha (10 <= length < 14) and true: coil (of length 3) followed by alpha (3 <= length < 6) and false: coil followed by beta followed by coil (c-b-c) and false: coil (6 <= length < 10) followed by alpha (of length 1 or 2) then the function of this ORF is &quot;mitochondrial transport&quot;
  • 29. Alignment YJL133W -------NEYNPLIHCLC----GSISGSTCAAITTPLDCIKTVLQIRG------------ 251 YKR052C -------NSYNPLIHCLC----GGISGATCAALTTPLDCIKTVLQVRG------------ 241 YIL006W ----NNTNSINLQRLIMA----SSVSKMIASAVTYPHEILRTRMQLKS------------ 310 YBR104W ----LTRNEIPPWKLCLF----GAFSGTMLWLTVYPLDVVKSIIQNDD------------ 271 YGR096W ----KTTAAHKKWELATLNHSAGTIGGVIAKIITFPLETIRRRMQFMNSKHLEK------ 250 YJR095W -----QMDVLPSWETSCI----GLISGAIGPFSNAPLDTIKTRLQKDK------------ 246 YKL120W -----LMKDGPALHLTAS-----TISGLGVAVVMNPWDVILTRIYNQK------------ 261 YLR348C -----FDASKNYTHLTAS-----LLAGLVATTVCSPADVMKTRIMNGS------------ 239 YMR166C ----DGRDGELSIPNEILT---GACAGGLAGIITTPMDVVKTRVQTQQPPSQSNKSYSVT 300 YDL198C ------DYSQATWSQNFIS---SIVGACSSLIVSAPLDVIKTRIQNRN------------ 242 YGR257C ----RFASKDANWVHFINSFASGCISGMIAAICTHPFDVGKTRWQISMMN---------- 302 YDL119C FIHYNPEGGFTTYTSTTVNTTSAVLSASLATTVTAPFDTIKTRMQLEP------------ 255 YJL133W -SQTVSLEIMRKADTFSKAASAIYQVYGWKGFWRGWKPRIVANMPATAISWTAYECAKHF 310 YKR052C -SETVSIEIMKDANTFGRASRAILEVHGWKGFWRGLKPRIVANIPATAISWTAYECAKHF 300 YIL006W -DIPDSIQRR-----LFPLIKATYAQEGLKGFYSGFTTNLVRTIPASAITLVSFEYFRNR 364 YBR104W -LRKPKYKNS-----ISYVAKTIYAKEGIRAFFKGFGPTMVRSAPVNGATFLTFELVMRF 325 YGR096W FSRHSSVYGSYKGYGFARIGLQILKQEGVSSLYRGILVALSKTIPTTFVSFWGYETAIHY 310 YJR095W ---SISLEKQSGMKKIITIGAQLLKEEGFRALYKGITPRVMRVAPGQAVTFTVYEYVREH 303 YKL120W ----GDLYKG-----PIDCLVKTVRIEGVTALYKGFAAQVFRIAPHTIMCLTFMEQTMKL 312 YLR348C ----GDHQP------ALKILADAVRKEGPSFMFRGWLPSFTRLGPFTMLIFFAIEQLKKH 289 YMR166C HPHVTNGRPAALSNSISLSLRTVYQSEGVLGFFSGVGPRFVWTSVQSSIMLLLYQMTLRG 360 YDL198C ---FDNPESG------LRIVKNTLKNEGVTAFFKGLTPKLLTTGPKLVFSFALAQSLIPR 293 YGR257C ---NSDPKGGNRSRNMFKFLETIWRTEGLAALYTGLAARVIKIRPSCAIMISSYEISKKV 359 YDL119C ----SKFTNS------FNTFTSIVKNENVLKLFSGLSMRLARKAFSAGIAWGIYEELVKR 305
  • 30. Alignment YJL133W -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 251 YKR052C -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 241 YIL006W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 310 YBR104W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 271 YGR096W ----cccccccccccccbaaaaaaaaaaaaaaacccaaaaaaaaaacccccccc------ 250 YJR095W -----cccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaccc------------ 246 YKL120W -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 261 YLR348C -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 239 YMR166C ----cccccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacccccccccccccc 300 YDL198C ------cccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacc------------ 242 YGR257C ----ccccccccccccaaaaaaaaaaaaaaaaacccaaaaaaaaaacccc---------- 302 YDL119C ccccccccccccccaaaaaaaaaaaaaaaaaaacccaaaaaaaaaacc------------ 255 YJL133W -ccccccccccccccaaaaaaaaaaaccccaaaaccaaaaaaacaaaaaaaaaaaaaaaa 310 YKR052C -ccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 300 YIL006W -ccccccccc-----aaaaaaaaaaaccccaaacccaaaaaaaccaaaaaaaaaaaaaaa 364 YBR104W -ccccccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 325 YGR096W cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 310 YJR095W ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 303 YKL120W ----cccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 312 YLR348C ----ccccc------aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 289 YMR166C cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 360 YDL198C ---cccccca------aaaaaaaaaacccaaaaacccaaaaaaaaaaaaaaaaaaaaaaa 293 YGR257C ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 359 YDL119C ----ccccca------aaaaaaaaaacccaaaaacccaaaaaaccaaaaaaaaaaaaaaa 305
  • 31. Homology rule
    • This rule is 100% accurate on test data
    • Almost all matching ORFs are from the 20S proteasome subunit for degradation of proteins
    • These subunits exist in archaea and eukaryotes, but only in one specific branch of bacteria (actinomycetes).
    If the ORF is not weakly homologous to a protein in klebsiella and is strongly homologous to a protein in desulfurococcales and is strongly homologous to a short protein in cyprinidae then the function of this ORF is &quot;Protein fate (folding, modification, destination)&quot;
  • 32. Homology rule
    • This rule is 100% accurate on test data
    • Almost all matching ORFs are from the 20S proteasome subunit for degradation of proteins
    • These subunits exist in archaea and eukaryotes, but only in one specific branch of bacteria (actinomycetes).
    If the ORF is not weakly homologous to a protein in klebsiella and is strongly homologous to a protein in desulfurococcales and is strongly homologous to a short protein in cyprinidae then the function of this ORF is &quot;Protein fate (folding, modification, destination)&quot;
  • 33. Application of DMP to Bacterial Genomes
    • Successful for both M. tuberculosis and E. coli .
    • Of the ORFs with no assigned function >40% were predicted to have a function at one or more levels of the class hierarchy.
    • It was found that many of the predictive rules were more general than possible using sequence homology.
    • References
    • King et al. (2000) KDD 2000
    • King et al. (2000) Yeast (Comparative and Functional Genomics)
    • King et al. (2001) Bioinformatics
  • 34. Example Rule (level 2 E. coli ) If the ORF is not predicted to have a  -strand of length  3  a homologous protein from class Chytridiomycetes was found Then its functional class is “Cell processes, Transport/binding proteins” 12/13 (86%) correct on Test Set - probability of this result occurring by chance is estimated at 4x10 -7 . 24 ORFs of unknown function are predicted by the rule. 16 ORFs now with putative or confirmed function - 93.8% accurate predictions
  • 35. Experimental Conformation
    • The original bacterial ORF predictions were made over three years ago.
    • In the intervening time many more ORFs have been sequenced, making traditional homologous prediction methods more accurate and sensitive, and the function of some ORFs have been determined by wet biology.
    • The E. coli genome has been re-annotated by Monica Riley’s group.
  • 36. “Wet” Biology conformation
    • A number of predictions have been confirmed or falsified by new “wet” experimental data.
    • This new data is biased towards hard classes. Despite this the results are still good:
      • Level 2: 23 predictions - 47.8% accuracy
      • Level 3: 23 predictions - 43.4% accuracy
    This is very much better than random as there are many classes.
  • 37. Confirmation of “Wet” Predictions
  • 38. Extension to Arabidopsis Genome
    • Collaborative project with the Institute of Grassland and Environmental Research and the University of Nottingham.
    • Large increase in data: 6,000 (yeast) -> 25,000 ORFs.
    • Large amount of micro-array data from the Nottingham Arabidopsis stock centre.
    • The increase in data is a challenge to our machine learning algorithms, 100s MBs.
    • Clare , A., Karwath, A., Ougham, H. and King , RD (2006) Functional Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136;
  • 39. Results
    • Accuracy comparable to yeast and bacteria
    • Large fraction of genes of currently unknown function are predicted.
    • Some rules could be interpreted in terms of known biology
    • Clare , A., Karwath, A., Ougham, H. and King , RD (2006) Functional Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136;
  • 40. Gibberellin Biosynthesis Prediction
    • Gibberellin is an important plant hormone.
    • Chosen because of interesting phenotypes – often extreme size.
    • Insertion of a promoter to overproduce gene product.
    • Result
      • 2 days earlier flowering
      • Average leaf number and weight increased at 21 days.
    • This phenotype is consistent with prediction.
  • 41.  
  • 42. Leaf number increases more rapidly in the mutant (yellow bars) than in wildtype Landsberg erecta (blue bars)
  • 43. Paclobutrazol (P) (inhibitor of gibberllin) abolishes the difference between mutant (M) and wildtype (L) C = control
  • 44. Availability All rules and data available at http://www.aber.ac.uk/compsci/Research/bio/dss/ All predictions available at http://www.genepredictions.org
  • 45. ILP 2005 Challenge 1
    • Yeast function prediction data used as a community challenge: http://www.protein-logic.com/
    • The intention of the challenge was to provide a real-world data set to test of how far we have progressed in the field of ILP and multi-relational data mining. The questions we wanted to answer were: Are the tools up to the job? Do they scale? Do they handle noisy, sparse and complex data?
  • 46. ILP 2005 Challenge 2
    • A. J. Knobbe, E. K. Y. Ho, R. Malik: ILP CHallenge 2005: The Safarii MRDM environment. C. Perlich: Approaching the ILP 2005 challenge: Class-Conditional Bayesian Propositionalization for Genetic Classification. J. Struyf, C. Vens, T. Croonenborghs, S. Dzeroski, H. Blockeel: Applying Predictive Clustering Trees to the Inductive Logic Programming 2005 Challenge Data.
    • F. Riguzzi: A Simple Approach to a Multi-Label Classification Problem.
  • 47. Propositional Approach
    • Zafer Barutcuoglu, Robert E. Schapire and Olga G. Troyanskaya. Hierarchical multi-label prediction of gene function . Bioinformatics (in press)
    • Hierarchy of SVMs.
    • Uses a Bayesian net to combine predictions.
  • 48. Conclusions
    • Data mining and machine learning are powerful tools for functional genomics.
    • The DMP method can be successfully applied to different genomes (bacterial, yeast, Arabidopsis) to predict gene functional class.
    • Micro-array data is a useful component in DMP.
    • Biological insight can be extracted from DMP rules.
    • The structure of gene prediction problems makes them an exciting test bed for machine learning methods.
  • 49. Acknowledgements
    • Amanda Clare Aberystwyth
    • Andreas Karwath Freiburg (Aberystwyth)
    • Luc Dehaspe PharmaDM
    • Helen Ougham IGER
    • BBSRC
  • 50. The Need for Logic to Represent Scientific Knowledge
    • Logic is the best understood way to represent knowledge.
    • Traditional statistics, machine learning, and data mining are based on propositional logic.
    • For some problems we require a richer description language, i.e. first-order predicate calculus.
    • Using logic programming (predicate calculus) we can incorporate deduction, abduction, and induction.
  • 51. Inductive Logic Programming
    • Inductive Logic Programming (ILP) uses logic programs (first-order predicate calculus) to learn with: describe examples, theories, and background knowledge.
    • For certain types of problem ILP is a powerful data analysis technique - more accurate, and more comprehensible results than conventional methods.
    • Has been successfully applied to a number of biological/chemical problems.
  • 52. ILP for Science
    • The key advantage of ILP for scientific applications is that it allows the application of compact relational representations that are natural for scientists to use. This allows domain understandable rules to be automatically formed.
    • This advantage comes at a computational cost. However, non-technical reasons are probably the greatest barrier to adoption of ILP. For example, it is very difficult to explain the benefits of ILP to domain experts.
  • 53. Prediction of Lethality
    • Instead of using microarray-data to prediction the functional class of a gene we have been using the same approach to predict whether a gene knock-out will be lethal (grown in a rich medium).
    If false: the function of the ORF is cell cycle and true: the function of the ORF is rRNA transcription and in the micro-array experiment (cell cycle) the ORF expression is > -0.79 then the knockout is lethal. Example Rule: Test accuracy 82% (Default 21%).
  • 54. Summary Results
    • Using voting (2 or more rules agree on a prediction)
      • Level 2 :128 ORFs predicted - 87.5% accuracy
      • Level 3 : 23 ORFs predicted - 91.3% accuracy
    • All predictions
      • Level 2 :335 ORFs predicted - 64.5% accuracy
      • Level 3: 204 ORFs predicted - 44.6% accuracy