Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Making
protein
func0on
and
subcellular

localiza0on
predic0ons
–
challenges
and

opportuni0es

Fiona
Brinkman

Department
of
Molecular
Biology
and
Biochemistry

(Associate,
Faculty
of
Health
Sciences
and
School
of
Compu0ng
Sciences)

Simon
Fraser
University

Greater
Vancouver,
BC,
Canada

April
2014

•  Improving
seq
similarity/orthology-‐based
predic0ons
–
a
keystone

of
many
predictors

•  Improving
pathway/network-‐based
analysis
to
iden0fy
protein

func0ons

•  Future
challenges
and
opportuni0es
(using
protein
localiza0on
as

an
example
of
what
is
to
come)

What
we
MUST
do
to
move
AFP
forward….
2

3

One-‐to-‐one
orthologs
are,
in
par0cular,
more
func0onally
similar
to

each
other,
vs
other
orthologs,
paralogs,
when
>80%
seq
iden0ty

Func0onal
similarity
measured
by
GO
annota0on
similarity
(13
species)

Altenhoﬀ
AM
et
al.
PLoS
Comput
Biol.
2012

4

One-‐to-‐one
orthologs
are,
in
par0cular,
more
func0onally
similar
to

each
other,
vs
other
orthologs,
paralogs,
when
>80%
seq
iden0ty

Func0onal
similarity
measured
by
GO
annota0on
similarity
(13
species)

Altenhoﬀ
AM
et
al.
PLoS
Comput
Biol.
2012

6

If
true
ortholog
is
missing…

(gene
loss,
or
incomplete
genome)

Ingroup1
Ingroup2
Outgroup

Species
Tree:

Gene
Tree:

Ingroup1
Ingroup2
Outgroup

RBBH

Reciprocal
Best
Blast
Hit

FAIL
Gene
Tree:

Ingroup1
Outgroup

Ingroup2

Usual

Divergence

One
of
the
orthologous
genes

diverges
faster…

Paralog

RBBH

Paralog

Ortholuge
Uses
phyle0c
ra0os
to
diﬀeren0ate

Suppor0ng
Species
Divergence
(SSD)
orthologs

vs
proteins
more
divergent
than
expected
(non-‐SSD)

7

Ra*o1

distance{ ingroup1-‐ingroup2}

distance{ ingroup1-‐outgroup }

Ingroup1
Ingroup2
Outgroup

SSD

Non-‐SSD

Ortholuge
analysis
comparing
Burkholderia
cepacia

&
B.cenocepacia
(outgroup:
B.pseudomallei)

Ra*o2

distance{ ingroup1-‐ingroup2}

distance{ ingroup2-‐outgroup }

Ingroup1
Ingroup2
Outgroup

Whiteside
et
al
2013

PMID
23203876

0.000

0.200

0.400

0.600

0.800

1.000

KEGG

Orthology

Pfam
Domains
Tigrfam

Annota0ons

Subcellular

Localiza0ons

Propor*on
Predicted
Orthologs
in
600
Pairs
of
Bacterial
Species

SSD
Ortholog

Non-‐SSD

8

*
*
*

*

*
p-‐value
<
0.05

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

One
or
more

homologs
(based
on

BLAST
hits)

Propor*on

SSD
orthologs

Non-‐SSD

*

*
p-‐value
<
0.05

Non-‐SSD
“Orthologs”

more
likely:

-‐
Func0onally
dissimilar

-‐
Have
one
or
more

homologs

A Database of Ortholuge Evaluations
OrtholugeDB

(0nyurl.com/ortholugeDB)

•  Provides
pre-‐computed
ortholog
predic0ons
for
>1400
bacteria

and
archaea
(update
coming
next
month!),
with
further

Ortholuge
assessments

•  Covers
all
genes
in
fully
sequenced
bacterial
and
archaeal
genomes

•  Facilitates
visualiza0on
and
evalua0on
of
ortholog
predic0ons

9

Similar
issue
with
ini0al
metagenomics
seq

func0onal
evalua0on

1.  Simulated
reads
from
Pseudomonas
aeruginosa
PAO1

2.  Created
databases
at
diﬀerent
levels
of
clade
exclusion

•  E.g.
for
species
clade
exclusion
removed
all
Pseudomonas

aeruginosa
genomes
from
the
database

3.  Used
RAPSearch2
and
MEGAN5
to
assign
func0onal

categories
to
the
simulated
reads

4.  Calculated
propor0on
of
reads
assigned
to
each
func0onal

category
rela0ve
to
how
many
reads
expected

•  E.g:

10

Category

Expected
#

assigned

Actual
#

assigned

Rela0ve

Propor0on

Membrane

Transport
567
583
1.02822

Most
func0onal
categories
are
predicted
well

but
some
are
overpredicted
(ra0o
notably
>1)

0

0.5

1

1.5

2

2.5

Ra*o
of
assigned

rela*ve
to
expected

None

Species

Family

Class

Level of
clade
exclusion:
Ie. Endocrine system: 3 problematic
orthology groups – all with high #’s of
proteins (one has 3538 when median is 54!)

The
rela0ve
propor0ons
of
func0onal
categories
stays

rela0vely
consistent
as
clade
exclusion
level
increases

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

None
Species
Family
Class

Propor*on
of
reads
assigned

Clade
exclusion
level

Xenobio0cs
Biodegrada0on

and
Metabolism

Transcrip0on

Signal
Transduc0on

Replica0on
and
Repair

Infec0ous
Diseases

Nucleo0de
Metabolism

Neurodegenera0ve

Diseases

Metabolism
of
Other

Amino
Acids

Metabolism
of
Cofactors

and
Vitamins

Membrane
Transport

…

Improving
pathway-‐based
analysis

Issue:
Biomolecular
pathway
classiﬁca0ons
can
bias
analyses
of

pathways
found
to
be
upregulated
or
downregulated
by

transcriptome
(or
other
omics-‐level)
analysis

What
you
iden0fy
depends
on
how
everything
is
classiﬁed….

Need
beper
“signatures”
of
pathways…

Dealing
with
PART
of
the
issue…

Distribu0on
of
the
number
of
associated

pathways
for
human
genes
in
KEGG.

1
7-45
2
3
4
5
6

Membership
of
a
gene
in
mul0ple
pathways
is
the
norm,
not
the

excep0on…

Foroushani et al, 2014 PMCID: PMC3883547

Not
all
genes
are
equal…

Maroon:
pathway
member

White:
no
membership

All
genes
are
not

equivalent
signatures

of
a
given
pathway

Foroushani et al, 2014
PMCID: PMC3883547

Individual Gene ORA
Antigen processing and presentation
Graft-versus-host disease
Natural killer cell mediated cytotoxicity
Viral myocarditis
Allograft rejection
Cell adhesion molecules (CAMs)
Chemokine signaling pathway
Type I diabetes mellitus
Toll-like receptor signaling pathway
Cytokine-cytokine receptor interaction
Example:
Treated
vs
Untreated
Mouse
Severe
InﬂammaIon
–

Gene
Expression
Dataset

Standard Over-
Representation Analysis
(ORA) and Gene Set
Enrichment Analysis
(GSEA) treat all genes in
a given pathway as equal
indicators that that
pathway is significant.
à Emphasizes
generalist genes/
pathways

Pathway
Signatures
using
SIGORA:
IdenIfying
genes/gene
pairs

uniquely
associated
with
a
single
pathway

SIGORA identifies statistically significant enrichment of
Pathway Signatures in a gene list of interest.

Example: Treated vs Untreated Mouse Severe Inflammation –
Gene Expression Dataset

SIGORA
avoids
many
biologically
less
plausible
results
seen
by
other

methods
that
over-‐emphasize
generalist
genes/pathways.

For example, 6/8 up-regulated genes in “Type I diabetes mellitus”
pathway are also in the "Antigen processing and presentation" pathway.
Individual Gene ORA SIGORA
Antigen processing and presentation Antigen processing and presentation
Graft-versus-host disease Natural killer cell mediated cytotoxicity
Natural killer cell mediated cytotoxicity Complement and coagulation cascades
Viral myocarditis Toll-like receptor signaling pathway
Allograft rejection Cytokine-cytokine receptor interaction
Cell adhesion molecules (CAMs) Leukocyte transendothelial migration
Chemokine signaling pathway Cell adhesion molecules (CAMs)
Type I diabetes mellitus Cytosolic DNA-sensing pathway
Toll-like receptor signaling pathway Chemokine signaling pathway
Cytokine-cytokine receptor interaction

Future
challenges
and
opportuni0es

(using
bacterial
protein
localiza0on
as
an
example

of
what
is
to
come)

(Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741)
19

Bacterial
protein
subcellular
localiza0on
predic0on

•  Aids
genome
annota0on
and
predic0on
of
protein
func0on

•  Used
to
iden0fy
cell
surface/secreted
targets
for
drugs
and

diagnos0cs,
as
well
as
poten0al
vaccine
components

•  Many
pathogen-‐associated
virulence
factors
predicted
as
secreted

(Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741)
20

Signal
pep0des:
Non-‐cytoplasmic

Amino
acid
composi0on/paperns:
All
localiza0ons

-‐
Support
Vector
Machine’s
trained
with
amino
acid

composi0ons
or
frequent
subsequences

Transmembrane
helices:
Cytoplasmic
membrane

-‐
HMMTOP

PROSITE
mo0fs
with
100%
precision:
All
localiza0ons

Outer
membrane
mo0fs:
Outer
membrane

-‐
Iden0ﬁed
by
associa0on-‐rule
mining

Homology
to
proteins
of
experimentally
known
localiza0on:
All
loc.

-‐
“SCL-‐BLAST”
against
pro
of
known
localiza0on

-‐
E=10e-‐10
and
length
restric0on
for
precision

Integra0on

with
a

Baysian

Network

Yu
et
al
(2010)
BioinformaIcs
26:1608

PSORTb:
bacterial
protein
subcellular

localiza0on
(SCL)
predic0on
sosware

PSORTb:
version
3

22
• Type
III
secre0on
apparatus

• Pili/ﬁmbria

• Host-‐associated
SCL

• Flagellum

• Spore

• Gas
vesicle

Sub-‐category
localiza0on
predic0ons

Main
localiza0ons
predicted
Bacteria
and
Archaea
predic0ons

Gram-‐
nega6ve
SoNware Precision Recall
PSORTb
v3.0 96.8 88.0
PSORTb
v2.0 95.7 81.5
Gram-‐
posi6ve
PSORTb
v3.0 97.0 93.2

PSORTb
v2.0 96.7 89.3
Archaea

PSORTb
v3.0 95.0
93.3

PSORTb
v3.0:
high
precision,
improved
sensi0vity/
recall
and
genome
predic0on
coverage

0

10

20

30

40

50

60

70

80

90

100

PSORTb
v.2.
PSORTb
v.3.
Five-‐fold
cross
valida0on
Genome
predic0on
coverage

Gram-‐negaIve
Gram-‐posiIve

A
computa0onal
predictor
more
accurate
than
related
high-‐throughput
lab
methods

Classic
Gram
posi0ve
bacteria,
monoderms:
Thick
pep0doglycan,
no
outer
membrane

Classic
Gram
nega0ve
bacteria,
diderms:
Thin
pep0doglycan
+
outer
membrane

…but
can
have
Gram
nega0ves
with
no
outer
membrane
(i.e.
Mycoplasma)

or
a
diﬀerent
outer
membrane
(Synergistetes,
Sphingomonas),
or
Gram
posi0ve
(thick

peptdoglycan)
with
a
diﬀerent
outer
membrane
(Deinococcus
–
6
layers
in
cell

envelope!),
or
“acid
fast”with
asymmetric
lipid-‐containing
thick
cell
wall
(Mycobacteria)

Plus
bacterial
organelles
and
other
substructures

(ie.
magnetosome
of
Magnetospirillum)...

Solu*on:

-‐ 
For
whole
genome
(deduced-‐proteome)
analysis,

detect
key
protein
markers
of
a
par0cular
cell
type

(i.e.
Omp85
essen0al
for
classic
Gram
nega0ve
membrane)

-‐
For
single
protein
analysis,
learn
from
above
analysis,
plus

literature
cura0on,
the
most
likely
cell
type
for
a
given
phyla

…then
make
predic0ons
assuming
that
cell
“type”

Challenge:
Organismal
diversity

24
Reproduced under Fair Use

Challenge:
Temporal,
contextual
diversity

Proteins
can
be
associated
with
mul0ple
subcellular
localiza0ons

i.e.
Cell
division
proteins,
Autotransporters,
“protein
A
dependant
on
protein
B”

Solu0on:
Note
all
possible
localizaIons
since
Temporal,
contextual
predic0ons

non-‐trivial
–
not
enough
knowledge
for
most

Kjærgaard K et al. J. Bacteriol. 2000;182:4789-4796

Challenge:
Metagenomics

High
demand
for
PSORTb
to
be
able
to
analyze
metagenomic
sequences

….
under
development

Need
taxonomy
data
to
aid
predic0ons

(then
enable
appropriate
cell
type
analysis)

Through
over
a
decade
of
cura0ng
for,

making
and
evalua0ng
predictors
of

protein
localiza0on,
genomic
islands,
etc

What
makes
a
great
predictor?

Through
over
a
decade
of
cura0ng
for,

making
and
evalua0ng
predictors
of

protein
localiza0on,
genomic
islands,
etc

What
makes
a
great
predictor?

(besides
it
being
right)

☺

Bioinforma0cs
Predictor’s
Code
of
Conduct

-‐
Never
force
predic0ons
-‐
always
have
a
predic0on
op0on/category
of

“unknown”

Inspired
by
the
classic
“Data
Provider’s
Code
of
Conduct”
in
Stein
(2002)
Nature
417,
119-‐120

Example
of
forced
predic0ons:
PSORT
I
predic0on
method

Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69%
What’s
wrong
here?

Example
of
forced
predic0ons:
PSORT
I
predic0on
method

Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69%
No secreted/
extracellular
localization!

Inspired
by
the
classic
“Data
Provider’s
Code
of
Conduct”
in
Stein
(2002)
Nature
417,
119-‐120

-‐
Never
force
predic0ons
-‐
always
have
“unknown”
op0on/category

-‐
Ensure
open
source
-‐
enable
viewing
of
predic0on
method
details

-‐ 
Predictor
should
easily
be
trainable
with
diﬀerent
datasets

(if
applicable;
so
others
can
robustly
evaluate
accuracy)

-‐ 
Have
ability
to
run
locally
or
over
web
(with
an
API
is
preferred)

-‐ 
Provide
access
to
old
versions
(at
minimum
when
transi0oning

to
new
version)

-‐
Encourage
con0nuing
cura0on
from
the
literature/lab
experiments!

Incorporate
some
curaIon
eﬀorts
into
predictor
funding
applicaIons

Bioinforma0cs
Predictor’s
Code
of
Conduct

Bioinforma0cs
Predictor’s
Code
of
Conduct
-‐
evalua*on

33

-‐
Evaluate
precision
and
recall
(and
accuracy
measure
combos
thereof)

with
x-‐fold
cross
valida0on
and/or
new
datasets
(like
CAFA!)

-‐ 
ID
errors,
biases
and
provide
guidance
to
users
re
issues
to
watch
for

-‐ 
bias
in
training
and/or
tes0ng
datasets

(“homology
reduc0on”,
“clade
exclusion”
may
help)

-‐
errors
in
“gold
standard”
lab-‐based
measure

-‐
contextual/temporal
changes
in
proteins,
impac0ng
predic0on

(ie.
Func0on
changes
when
another
protein/compound
present)

What
we
MUST
do:

Guide
users
to
not
just
blindly
use
a
predictor
and
its
default
output.

What
we
MUST
do:

Guide
users
to
not
just
blindly
use
a
predictor
and
its
default
output.

Curators,
experimentalists,
and
automated
funcIon
predictor

developers
must
coordinate
eﬀorts
more

•  Experimentalists
working
on
what

they
think
best…

•  Curators
cura0ng
what
they

priori0ze…

•  Func0on
predictors
op0mizing

predic0on
using
exis0ng
data….

FuncIon
predictors/bioinformaIcists
need
to
get
in
the
drivers
seat

more
for
research

Bioinforma0cs
Predictor’s
Code
of
Conduct

Brinkman
Lab
Kayaking
Trip,
Summer
2013

(Next
up,
Archery
Tag!)

Amir
Foroushani

Maphew
Laird

David
Lynn

Raymond
Lo

Mike
Peabody

Thea
Van
Rossum

Maphew
Whiteside

Nancy
Yu

Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

Similar to Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities (20)

Recently uploaded

Recently uploaded (20)

Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities