Combining large-scale evolutionary analyseswith multiple biological data sources to predicthuman protein function David Jones UCL Depts. of Computer Science and Structural and Molecular Biology
BackgroundIn Uniprot, 30% of human … and only 0.5% haveproteins still have no completely specific onesfunctional annotations at for all aspectsall CC MF 30% MF BP CC BP
Main approaches for function annotation• Annotation transfers by homology e.g. BLAST, HMMER Only applicable to a subset of the data Has reached a plateau in terms of novel function annotation but provides highest quality information• Model-classifier based using sequence features Limited to common and broad functions for which there are many examples
FFPRED - Function Prediction PipelineNovel sequence Amino acid sequenceCharacteristics structure disorder aa transmem motifs localisation Classification GO Term SVM posterior probability estimate
Going further – computing gene functionfrom multiple data sources• FFPRED is a currently available server for human (and vertebrate) proteins• It works well but is limited to predicting only the functional classes that it was trained to recognize• Extending the library requires time consuming training of new SVM models• It also cannot be applied to rare functional classes due to limited training sets
Desirable features of a new approach• Able to annotate all sequences• Able to predict rare functions• Able to offer something more than simple homology-based approaches• Amenable to easy and quick updating
FunctionSpace Data Sources for H. sapiens• Sequence similarity• Signal peptides and other local features• Predicted secondary structure• Transmembrane segments• Predicted disordered regions• Domain architecture patterns• Gene fusion information• Gene co-expression• Protein-protein interactions For each sequence 49,231 features were derived
AimTo estimate the functional similarity (a.k.a. semantic distance)between two human proteins from their sequence featuresplus available high throughput data.Protein A Functional Similarity ScoreProtein B
Large-scale (domain-based) evolutionaryfeatures• Patterns of domain occurrence can provide valuable functional clues• “Deeper” homology detection allows greater coverage• We make use of our in-house fold/domain recognition method and several public domain libraries
pDomTHREADER Domain CoverageResidues 35.7% Gene3d CATH Domain annotations 81.6% 7000000 threading 6000000 5000000Sequences 4000000 3000000 2000000 1000000 64.8% 59.4% Gene3d 0 Public domain Threading threading 37.56 % increase in domain annotations across 5.5M sequences ~ 1.7 million novel domain assignments over public domain data
Computational Practicalities Legion Nodes 5.5M Query sequences Sequence 2Gb database (5.5M seqs) PSIBLAST Find matches & 1min – 3 hours generate alignments Store & post process“Embarrassingly parallel” application: one sequence = one job.Ideal capacity filling task for a modern supercomputer like Legion.
Gene Fusion Events can Predict Protein-Protein Interactions from Sequence DataH1 3.90.850.10 126.96.36.199 H2 fumaryl aceto acetase beta lactamase Bi-functional enzyme 3.90.850.10 188.8.131.52 Mycobacterium tuberculosis Mycobacterium paratuberculosis Mycobacterium avium Hydrolase activity Hydrolysis of C-N bonds Hydrolysis of C-C bonds
A Novel Gene Fusion Discovered using CATHdomain fusion analysis Phosphoglyceromutase DNA repair (RAD50) 184.108.40.206 220.127.116.110 Alpha-D-Glucose-1,6-Bisphosphate P-loop nucleotide triphosphate hydrolases 18.104.22.168 Transcription coupling repair factor 22.214.171.124 126.96.36.1990 Saccharopolyspora erythraea Syntrophomonas wolfei Oxidative stress D-glucose metabolism DNA repair
Novel Gene Fusion Discovery 188.8.131.52 184.108.40.2060 220.127.116.110 Saccharopolyspora erythraea 18.104.22.1680 22.214.171.124 Syntrophomonas wolfei Novel annotations • Rice PGM1 gene annotated as GO:0006950 response to stress • PGM3 has relationship with DNA repair sequence Kanazawa K, Ashida H (1991) Relationship between oxidative stress and hepatic phosphoglucomutase activity in rats. Int J Tissue React 13: 225
Domain based features Score architectures Score complexes 7960 features 11210 features
Fusion scoring Each domain is a feature, score has 2 components 1. Prediction quality (logistic transform of feature) 2. Promiscuity weight related to the number of times the sequence occurs as part of a fused product wi = log fus i
Integration of “External” Features:Microarray Expression Data Gene Gene 14 A B Probe Signal (log2) 12 Normalised Microarray Datasets 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Experiment (conditions) Pearson Correlation (R)
Biclustering Microarray Expression Data Zinc binding sequences A set of transcription factors global correlation 0.42 global correlation 0.4823912 features generated from biclustering of 2346 publicly available microarrays (81 experiments) using BIMAX algorithm
FunctionSpace: Two-stage Integration of Data SVMsw SVMloc SVMss FeatureProtein A vectors SVMtm SVMdis Functional SVMdpc SVMfsc Similarity SVMgfc Score SVMdpp Feature SVMgfpProtein B vectors SVMge SVMppi
A 3-D Projection of Annotated Human Proteins• 49,231 dimensions first reduced to 11 dimensions by SVM regression with 11 different groups of features• Each protein is here represented as a point in this derived 11-D feature space projected into 3-D• Colouring is according to functional similarity which shows that proteins with similar functions (warmer colours) cluster strongly in this space• 75% of nearest neighbour pairs share common GO terms
Function Annotation Results for 20674Unannotated IPI Human Sequences Each sequence is classed “Easy”, “Medium” or “Hard” depending on degree of homology to functionally annotated proteins in UNIPROT.
Preliminary ResultsIn 2009 FunctionSpace produced GO term predictions for 19678 IPIuncharacterized human sequences. 2746 have been annotated since. MF Measure BP 16% % Exact Matches 9% -1.3 Mean semantic distance -1.7 Less More Less More specific specific specific specific
Initial considerations for CAFA• 50,000 sequences• 11 eukaryotic & 7 prokaryotic species• High specificity annotations needed• Partial descriptive text already in Swiss-Prot/Uniprot for some entries• FFPRED/FunctionSpace would not be enough• Need to incorporate textual information from databases and comprehensive homology(orthology)-derived labels• Need to get all this working in a few months!
Best Laid Plans for CAFA• Plan A – Build separate annotation pipelines for missing data – Calibrate each pipeline according to precision values derived from benchmark on 500 highly annotated Swiss-Prot entries – Combine pipeline annotations using high-level classifier (SVM or Naive Bayes)• Plan B – No time to build high-level classifier! – Combine annotation sources using heuristic graphical approach• Hope for the best! (and expect the worst...)
GO term prediction from Swiss-Prot text-mining• For targets which already had descriptive text, keywords or comments in Swiss-Prot, GO terms were assigned using a naive Bayes text-classification approach• Single words and groups of 2 and 3 words were counted• Words occurring in different Swiss-Prot record types were distinguished in the analysis, and some simple pre-parsing of feature (FT) records was carried out in addition.
Homology-based annotation sources• PSI-BLAST searches against Uniprot – Low E-value threshold to ensure close homologues used for annotation transfer – Alignment length threshold to avoid domain problem• Transfer of annotations from orthologues – EggNOG 2.0 – More reliable GO term transfer than for PSI-BLAST but lower coverage• Profile-profile searches against Swiss-Prot – Low reliability transfer from very distant homologues – Improves coverage where needed (at expense of specificity)
Heuristic back-propagation of precisionestimates Back-propagation repeated for each annotation source Back-propagation to define a of precision consensus for estimates each node P’ = 1 - (1 – P) (1 – Q)
Final steps• After back-propagation, all referenced GO terms are ranked according to final confidence scores• To reduce conflicting annotations, pairs of terms with zero observed co-occurrence frequency in GOA are subjected to pairwise tournament selection.• Results submitted to server using the mouse-window-cut-paste-click-submit algorithm
CASP vs CAFA from a Predictor’s Point ofView• Number of targets – Manual vs automated approaches• Difficulty of targets – A major limit in driving CASP forwards• Assessment – Hard to pre-judge impact of decisions made during prediction season• Tools for the community – Standards and methods in CASP have been very useful• Getting the word out to the wider community
Acknowledgements Anna Lobley Domenico Cozzetto Daniel Buchan Kevin Bryson Christine Orengo