• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Prediction of protein function from sequence derived protein features
 

Prediction of protein function from sequence derived protein features

on

  • 2,813 views

Technical University of Denmark, Lyngby, October 23, 2002

Technical University of Denmark, Lyngby, October 23, 2002

Statistics

Views

Total Views
2,813
Views on SlideShare
2,803
Embed Views
10

Actions

Likes
2
Downloads
93
Comments
1

1 Embed 10

http://www.slideshare.net 10

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Prediction of protein function from sequence derived protein features Prediction of protein function from sequence derived protein features Presentation Transcript

    • Prediction of Protein Function from Sequence Derived Protein Features Lars Juhl Jensen
    • Function unknown for 40% of human proteins
    • Pairwise alignment
      • > carp Cyprinus carpio growth hormone 210 aa vs.
      • > chicken Gallus gallus growth hormone 216 aa
      • scoring matrix: BLOSUM50, gap penalties: -12/-2
      • 40.6% identity; Global alignment score: 487
      • 10 20 30 40 50 60 70
      • carp MA--RVLVLLSVVLVSLLVNQGRASDN-----QRLFNNAVIRVQHLHQLAAKMINDFEDSLLPEERRQLSKIFPLSFCNSD
      • :: . : ...:.: . : :. . :: :::.:.:::: :::. ..:: . .::..: .: .:: :.
      • chicken MAPGSWFSPLLIAVVTLGLPQEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFERTYIPEDQRYTNKNSQAAFCYSE
      • 10 20 30 40 50 60 70 80
      • 80 90 100 110 120 130 140 150
      • carp YIEAPAGKDETQKSSMLKLLRISFHLIESWEFPSQSLSGTVSNSLTVGNPNQLTEKLADLKMGISVLIQACLDGQPNMDDN
      • : ::.:::..:..: ..:::.:. ::.:: : : ::. .:.:. :. ... ::: ::. ::..:.. : .: .
      • chicken TIPAPTGKDDAQQKSDMELLRFSLVLIQSWLTPVQYLSKVFTNNLVFGTSDRVFEKLKDLEEGIQALMRELEDRSPR---G
      • 90 100 110 120 130 140 150 160
      • 170 180 190 200 210
      • carp DSLPLP-FEDFYLTM-GENNLRESFRLLACFKKDMHKVETYLRVANCRRSLDSNCTL
      • .: : .. : . . .:. : ... ::.:::::.:::::::.: .::: .::::.
      • chicken PQLLRPTYDKFDIHLRNEDALLKNYGLLSCFKKDLHKVETYLKVMKCRRFGESNCTI
      • 170 180 190 200 210
    • Functional assignment: alignment versus prediction
      • Alignment is good for transferring knowledge about the function of homologous proteins
      • For orphan proteins there is no knowledge to transfer
      • Orphan sequences must thus be handled by true prediction tools rather than alignment
      • Develop a prediction method that works for orphans but only requires sequence input
      • Assign a possible function to as many of the orphans as possible
      • Screen the human genome for novel pharmaceutical targets such as transcription factors and receptors
    • The paradigm: sequence to structure to function
      • Structure does play a very important role for the function proteins
      • Structure is not very useful for prediction of protein function
        • For proteins of unknown function, the structure is rarely known
        • Prediction of 3D structure from sequence is a very difficult unsolved problem
        • Prediction of protein function from structure is by many considered an even harder problem
      • Predicted secondary structure/fold can be used
    • 1AOZ (129 aa) vs. 1PLC (99 aa) scoring matrix: BLOSUM50, gap penalties: -12/-2 15.5% identity; Global alignment score: -23 10 20 30 40 50 60 1AOZ SQIRHYKWEVEYMFWAPNCNENIVMGINGQFPGPTIRANAGDSVVVELTNKLHTEGVVIH .. .. : ... . . ..: . :...: . .: ...:. 1PLC ---------IDVLLGA---DDGSLAFVPSEFS-----ISPGEKIVFK-NNAGFPHNIVFD 10 20 30 40 70 80 90 100 110 120 1AOZ WHGILQRGTPWADGTASISQCAINPGETFFYNFTVDNPGTFFYHGHLGMQRSAGLYGSLI .: :. . . : . :::: .. . .:. : : ::. :.. 1 PLC EDSI-PSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQG----AGMVGKVT 50 60 70 80 90 1AOZ VDPPQGKKE :. 1PLC VN-------
    • An enzyme and a non-enzyme from the Cupredoxin superfamily
    • Function prediction from post translational modifications
      • Proteins with similar function may not be related in sequence
      • Still they must perform their function in the context of the same cellular machinery
      • Similarities in features such like PTMs and physical/chemical properties could be expected for proteins with similar function
    • Functional classes predicted
      • Functional role (Monica Riley categories)
        • The original scheme had 14 categories
        • We reduce it to 12 categories by skipping the category ”other” and combining replication and transcription
      • Enzyme prediction
        • Enzyme vs. non-enzyme
        • Major enzyme class in the EC system
      • Gene Ontology
        • A subset of classes can be predicted
      • Systems related categories
        • For example “cell cycle regulated’’
    • The concept of ProtFun
      • Predict as many biologically relevant features as we can from the sequence
      • Train artificial neural networks for each category, also optimizing the feature combinations
      • Assign a probability for each category from the NN outputs
    • Training of neural networks
      • Human protein protein sequences from SWISS-PROT were assigned to functional classes based on their keywords by using the EUCLID dictionary
      • The set of sequences was divided into a test and a training set with no significant sequence similarity between the two sets
      • Neural networks were first trained for single features and subsequently for combinations of the best performing features
    • Prediction performance on cellular role categories
    •  
    • An enzyme and a non-enzyme from the Cupredoxin superfamily
    • Similar structure different functions
      • Many examples exist of structurally similar proteins which have different functions
      • Two PDB structures from the Cupredoxin superfamily were shown
        • 1AOZ is an enzyme
        • 1PLC is not an enzyme
      • Despite their structural similarity, our method predicts both correctly
      # Functional category 1AOZ 1PLC Amino_acid_biosynthesis 0.126 0.070 Biosynthesis_of_cofactors 0.100 0.075 Cell_envelope 0.429 0.032 Cellular_processes 0.057 0.059 Central_intermediary_metabolism 0.063 0.041 Energy_metabolism 0.126 0.268 Fatty_acid_metabolism 0.027 0.072 Purines_and_pyrimidines 0.439 0.088 Regulatory_functions 0.102 0.019 Replication_and_transcription 0.052 0.089 Translation 0.079 0.150 Transport_and_binding 0.032 0.052 # Enzyme/nonenzyme Enzyme 0.773 0.310 Nonenzyme 0.227 0.690 # Enzyme class Oxidoreductase (EC 1.-.-.-) 0.077 0.077 Transferase (EC 2.-.-.-) 0.260 0.099 Hydrolase (EC 3.-.-.-) 0.114 0.071 Lyase (EC 4.-.-.-) 0.025 0.020 Isomerase (EC 5.-.-.-) 0.010 0.068 Ligase (EC 6.-.-.-) 0.017 0.017
    • Evolution conserves protein features and function
      • Protein features are more conserved between orthologs than paralogs
      • This leads to ProtFun predicting orthologs to be more likely to share function than paralogs
      • That prediction is fully consistent with the notion that it is best to infer function from orthologous proteins
    • ProtFun performance for other organisms
      • Our predictors work in general for eukaryotes
      • Some categories work quite well for prokaryotes
        • Most metabolism categories
        • Transport and binding
      • While other categories fail
        • Energy metabolism
        • Regulatory functions
    • Mapping category performances onto input features
    • Performance contribution of sequence derived features
      • The correlations between features and function is conserved for eukaryotes
      • Some correlations extend to archaea and bacteria
        • Physical/chemical properties
        • Secondary structure and transmembrane helices
      • Other correlations only hold for eukaryotes
        • PTMs and Subcellular localization features
    • Are our classes meaningful?
    • A better classification system: the Gene Ontology
      • Standardized by the Gene Ontology Consortium
      • Proteins can belong to multiple classes
      • Different kinds of function can be annotated:
        • Molecular function
        • Biological process
        • Cellular component
      • GO assigns the “function” at several levels of detail
    • Training of the Gene Ontology predictor
      • GO numbers were assigned to all human SWISS-PROT and TREMBL entries based on matches to InterPro
      • Classes annotated to fewer than 20 different InterPro families were discard
      • The sequences were split into five sets of equal size where significant similarity only exist within sets – not between sets
      • Using this data set neural networks were trained in sets of five constituting a five fold cross validation
      • Single feature neural nets were first trained on each remaining category
      • Neural networks using combinations of features were trained on promising categories resulting in 14 good predictors
    • Prediction performance on Gene Ontology categories
      • Predicts many pharmaceutically interesting classes
      • 70% of hormones and receptors can be predicted at a false positive rate of only 5%
      • All categories can be predicted with a sensitivity of 50% and 10% rate of false positives
    • Features usage
      • Transmembrane helices important for prediction of
        • Receptors
        • Transporters
        • Ion channels
      • Subcellular localization good for predicting
        • Receptors
        • Transcription (regulation)
    • Possible novel receptor
      • No BLAST matches against SWISS-PROT with an E-value below 1
      • A Pfam search yielded a questionable match to TGF-beta type III receptors (E-value 0.28)
      • While this match is not significant on its own, it supports the predictions
      ############## ProtFun 2.0 predictions ############## >ENSP00000257015 # Functional category Prob Odds Amino_acid_biosynthesis 0.021 0.955 Biosynthesis_of_cofactors 0.032 0.444 Cell_envelope => 0.661 10.836 Cellular_processes 0.039 0.534 Central_intermediary_metabolism 0.042 0.667 Energy_metabolism 0.043 0.478 Fatty_acid_metabolism 0.043 3.308 Purines_and_pyrimidines 0.164 0.675 Regulatory_functions 0.014 0.087 Replication_and_transcription 0.020 0.075 Translation 0.033 0.750 Transport_and_binding 0.834 2.034 # Enzyme/nonenzyme Prob Odds Enzyme 0.202 0.705 Nonenzyme => 0.798 1.118 # Enzyme class Prob Odds Oxidoreductase (EC 1.-.-.-) 0.055 0.264 Transferase (EC 2.-.-.-) 0.032 0.093 Hydrolase (EC 3.-.-.-) 0.077 0.243 Isomerase (EC 4.-.-.-) 0.020 0.426 Ligase (EC 5.-.-.-) 0.010 0.313 Lyase (EC 6.-.-.-) 0.017 0.334 # Gene Ontology category Prob Odds Signal_transducer 0.493 2.304 Receptor => 0.734 4.318 Hormone 0.001 0.154 Structural_protein 0.001 0.036 Transporter 0.050 0.459 Ion_channel 0.035 0.614 Voltage-gated_ion_channel 0.002 0.091 Cation_channel 0.010 0.217 Transcription 0.050 0.391 Transcription_regulation 0.021 0.168 Stress_response 0.364 4.136 Immune_response 0.477 5.612 Growth_factor 0.117 8.357 Metabolism 0.142 0.307 Metal_ion_transport 0.013 0.394
    • Summary
      • A method for prediction of “protein function” has been developed for human proteins
      • This method has been successfully applied to a number different categorization systems
      • The feature usage of the neural networks is in agreement with current biological knowledge
      • Cross-species tests show that the prediction methods developed on human proteins work for most eukaryotes
      • The evolutionary aspects of “feature space” have been discussed
    • Acknowledgements
      • Other people at CBS
        • David Ussery
        • Marie Skovgaard
        • Ulrik de Lichtenberg
        • Thomas Skøt Jensen
        • Anne Mølgaard
      • The EUCLID team at CNB/CSIC, Madrid
        • Alfonso Valencia
        • Damien Devos
        • Javier Tamames
      • The ProtFun team at CBS
        • Søren Brunak
        • Ramneek Gupta
        • Can Kesmir
        • Kristoffer Rapacki
        • Hans-Henrik Stærfeldt
        • Henrik Nielsen
        • Nikolaj Blom
        • Claus A.F. Andersen
        • Anders Krogh
        • Steen Knudsen
        • Chris Workman
    • Thank you!