Vienna afp2011

Combining large-scale evolutionary analyses
with multiple biological data sources to predict
human protein function

David Jones
UCL Depts. of Computer Science and Structural and Molecular Biology

Background
In Uniprot, 30% of human … and only 0.5% have
proteins still have no completely specific ones
functional annotations at for all aspects
all

CC MF

30%
MF BP

CC

BP

Main approaches for function annotation
• Annotation transfers by homology
e.g. BLAST, HMMER
Only applicable to a subset of the data
Has reached a plateau in terms of novel function
annotation but provides highest quality information

• Model-classifier based using sequence features
Limited to common and broad functions for which there
are many examples

FFPRED - Function Prediction Pipeline
Novel sequence Amino acid sequence

Characteristics
structure disorder aa transmem motifs localisation

Classification GO Term
SVM

posterior probability
estimate

Going further – computing gene function
from multiple data sources
• FFPRED is a currently available server for human
(and vertebrate) proteins

• It works well but is limited to predicting only the
functional classes that it was trained to recognize
• Extending the library requires time consuming
training of new SVM models
• It also cannot be applied to rare functional classes
due to limited training sets

Desirable features of a new approach

• Able to annotate all sequences

• Able to predict rare functions

• Able to offer something more than simple
homology-based approaches

• Amenable to easy and quick updating

FunctionSpace Data Sources for H. sapiens

• Sequence similarity
• Signal peptides and other local features
• Predicted secondary structure
• Transmembrane segments
• Predicted disordered regions
• Domain architecture patterns
• Gene fusion information
• Gene co-expression
• Protein-protein interactions

For each sequence 49,231 features were derived

Aim
To estimate the functional similarity (a.k.a. semantic distance)
between two human proteins from their sequence features
plus available high throughput data.

Protein A

Functional
Similarity
Score

Protein B

Large-scale (domain-based) evolutionary
features

• Patterns of domain occurrence can provide
valuable functional clues

• “Deeper” homology detection allows greater
coverage

• We make use of our in-house fold/domain
recognition method and several public domain
libraries

pDomTHREADER Domain Coverage

Residues 35.7% Gene3d

CATH Domain annotations
81.6% 7000000

threading 6000000

5000000

Sequences 4000000

3000000

2000000

1000000

64.8% 59.4% Gene3d 0
Public domain Threading

threading

37.56 % increase in domain annotations across 5.5M sequences
~ 1.7 million novel domain assignments over public domain data

Computational Practicalities
Legion Nodes

5.5M Query
sequences Sequence
2Gb database
(5.5M seqs)

PSIBLAST Find
matches &
1min – 3 hours generate
alignments

Store &
post process

“Embarrassingly parallel” application: one sequence = one job.
Ideal capacity filling task for a modern supercomputer like Legion.

Gene Fusion Events can Predict Protein-
Protein Interactions from Sequence Data
H1 3.90.850.10 3.60.15.10 H2
fumaryl aceto acetase beta lactamase

Bi-functional enzyme
3.90.850.10 3.60.15.10
Mycobacterium tuberculosis

Mycobacterium paratuberculosis

Mycobacterium avium
Hydrolase activity

Hydrolysis of C-N bonds Hydrolysis of C-C bonds

A Novel Gene Fusion Discovered using CATH
domain fusion analysis

Phosphoglyceromutase DNA repair (RAD50)

3.40.120.10 3.40.50.300
Alpha-D-Glucose-1,6-Bisphosphate P-loop nucleotide triphosphate hydrolases

3.40.120.10
Transcription coupling repair factor

3.40.120.10 3.40.50.300
Saccharopolyspora erythraea

Syntrophomonas wolfei

Oxidative stress

D-glucose metabolism DNA repair

Novel Gene Fusion Discovery

3.40.120.10 3.40.50.300 3.40.50.300
Saccharopolyspora erythraea
3.40.50.300 3.40.120.10
Syntrophomonas wolfei

Novel annotations

• Rice PGM1 gene annotated as GO:0006950 response to
stress
• PGM3 has relationship with DNA repair sequence

Kanazawa K, Ashida H (1991) Relationship between oxidative stress
and hepatic phosphoglucomutase activity in rats. Int J Tissue React 13: 225

Domain based features

Score
architectures

Score
complexes

7960 features
11210 features

Fusion scoring

Each domain is a feature, score has 2 components

1. Prediction quality (logistic transform of feature)
2. Promiscuity weight related to the number of times the sequence
occurs as part of a fused product wi = log fus
i

Integration of “External” Features:
Microarray Expression Data
Gene Gene 14
A B

Probe Signal (log2)
12
Normalised Microarray Datasets

10

8

6

4

2

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Experiment (conditions)

Pearson Correlation (R)

Biclustering Microarray Expression Data

Zinc binding sequences A set of transcription factors
global correlation 0.42 global correlation 0.48

23912 features generated from biclustering of 2346 publicly available microarrays
(81 experiments) using BIMAX algorithm

FunctionSpace: Two-stage Integration of Data

SVMsw

SVMloc

SVMss

Feature
Protein A vectors
SVMtm

SVMdis
Functional
SVMdpc SVMfsc Similarity
SVMgfc
Score

SVMdpp

Feature SVMgfp
Protein B vectors
SVMge

SVMppi

A 3-D Projection of Annotated Human Proteins

• 49,231 dimensions first
reduced to 11 dimensions by
SVM regression with 11
different groups of features
• Each protein is here
represented as a point in this
derived 11-D feature space
projected into 3-D
• Colouring is according to
functional similarity which
shows that proteins with similar
functions (warmer colours)
cluster strongly in this space
• 75% of nearest neighbour pairs
share common GO terms

Individual Feature Contributions
Matthews Correlation Coefficient

Function Annotation Results for 20674
Unannotated IPI Human Sequences

Each sequence is classed “Easy”, “Medium” or “Hard” depending on
degree of homology to functionally annotated proteins in UNIPROT.

Preliminary Results
In 2009 FunctionSpace produced GO term predictions for 19678 IPI
uncharacterized human sequences. 2746 have been annotated since.
MF Measure BP
16% % Exact Matches 9%
-1.3 Mean semantic distance -1.7

Less More Less More
specific specific specific specific

Initial considerations for CAFA

• 50,000 sequences
• 11 eukaryotic & 7 prokaryotic species
• High specificity annotations needed
• Partial descriptive text already in Swiss-Prot/Uniprot for some
entries

• FFPRED/FunctionSpace would not be enough

• Need to incorporate textual information from databases
and comprehensive homology(orthology)-derived labels

• Need to get all this working in a few months!

Best Laid Plans for CAFA

• Plan A
– Build separate annotation pipelines for missing data
– Calibrate each pipeline according to precision values derived from
benchmark on 500 highly annotated Swiss-Prot entries
– Combine pipeline annotations using high-level classifier (SVM or Naive
Bayes)

• Plan B
– No time to build high-level classifier!
– Combine annotation sources using heuristic graphical approach

• Hope for the best!
(and expect the worst...)

GO term prediction from Swiss-Prot
text-mining

• For targets which already had
descriptive text, keywords or
comments in Swiss-Prot, GO terms
were assigned using a naive Bayes
text-classification approach
• Single words and groups of 2 and 3
words were counted
• Words occurring in different Swiss-Prot
record types were distinguished in the
analysis, and some simple pre-parsing
of feature (FT) records was carried out
in addition.

Homology-based annotation sources

• PSI-BLAST searches against Uniprot
– Low E-value threshold to ensure close homologues used for
annotation transfer
– Alignment length threshold to avoid domain problem
• Transfer of annotations from orthologues
– EggNOG 2.0
– More reliable GO term transfer than for PSI-BLAST but lower
coverage
• Profile-profile searches against Swiss-Prot
– Low reliability transfer from very distant homologues
– Improves coverage where needed (at expense of specificity)

Heuristic back-propagation of precision
estimates

Back-propagation
repeated for each
annotation source
Back-propagation
to define a
of precision
consensus for
estimates
each node
P’ = 1 - (1 – P) (1 – Q)

Final steps
• After back-propagation, all referenced GO terms
are ranked according to final confidence scores

• To reduce conflicting annotations, pairs of terms
with zero observed co-occurrence frequency in
GOA are subjected to pairwise tournament
selection.

• Results submitted to server using the
mouse-window-cut-paste-click-submit
algorithm

CASP vs CAFA from a Predictor’s Point of
View
• Number of targets
– Manual vs automated approaches
• Difficulty of targets
– A major limit in driving CASP forwards
• Assessment
– Hard to pre-judge impact of decisions made during
prediction season
• Tools for the community
– Standards and methods in CASP have been very useful
• Getting the word out to the wider community

Acknowledgements

Anna Lobley
Domenico Cozzetto
Daniel Buchan

Kevin Bryson
Christine Orengo

Vienna afp2011

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Vienna afp2011

Similar to Vienna afp2011 (20)

More from Iddo

More from Iddo (13)

Recently uploaded

Recently uploaded (20)

Vienna afp2011