Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Making	
  protein	
  func0on	
  and	
  subcellular	
  
localiza0on	
  predic0ons	
  –	
  challenges	
  and	
  
opportuni0e...
•  Improving	
  seq	
  similarity/orthology-­‐based	
  predic0ons	
  –	
  a	
  keystone	
  
of	
  many	
  predictors	
  
	...
3	
  
	
  
	
  
One-­‐to-­‐one	
  orthologs	
  are,	
  in	
  par0cular,	
  more	
  func0onally	
  similar	
  to	
  
each	
...
4	
  
	
  
	
  
One-­‐to-­‐one	
  orthologs	
  are,	
  in	
  par0cular,	
  more	
  func0onally	
  similar	
  to	
  
each	
...
6	
  
If	
  true	
  ortholog	
  is	
  missing…	
  	
  
(gene	
  loss,	
  or	
  incomplete	
  genome)	
  
	
  
Ingroup1	
  ...
Ortholuge
Uses	
  phyle0c	
  ra0os	
  to	
  differen0ate	
  	
  
Suppor0ng	
  Species	
  Divergence	
  (SSD)	
  orthologs	
...
0.000	
  
0.200	
  
0.400	
  
0.600	
  
0.800	
  
1.000	
  
KEGG	
  
Orthology	
  
Pfam	
  Domains	
   Tigrfam	
  
Annota0...
A Database of Ortholuge Evaluations
OrtholugeDB	
  	
  	
  (0nyurl.com/ortholugeDB)	
  
•  Provides	
  pre-­‐computed	
  o...
Similar	
  issue	
  with	
  ini0al	
  metagenomics	
  seq	
  
func0onal	
  evalua0on	
  
1.  Simulated	
  reads	
  from	
 ...
Most	
  func0onal	
  categories	
  are	
  predicted	
  well	
  
but	
  some	
  are	
  overpredicted	
  (ra0o	
  notably	
 ...
The	
  rela0ve	
  propor0ons	
  of	
  func0onal	
  categories	
  stays	
  
rela0vely	
  consistent	
  as	
  clade	
  exclu...
Improving	
  pathway-­‐based	
  analysis	
  
Issue:	
  Biomolecular	
  pathway	
  classifica0ons	
  can	
  bias	
  analyses...
Dealing	
  with	
  PART	
  of	
  the	
  issue…	
  	
  
	
  
Distribu0on	
  of	
  the	
  number	
  of	
  associated	
  	
  ...
Not	
  all	
  genes	
  are	
  equal…	
  	
  
Maroon:	
  pathway	
  member	
  	
  	
  White:	
  no	
  membership	
  
	
  
	...
Individual Gene ORA
Antigen processing and presentation
Graft-versus-host disease
Natural killer cell mediated cytotoxicit...
Pathway	
  Signatures	
  using	
  SIGORA:	
  IdenIfying	
  genes/gene	
  pairs	
  	
  
uniquely	
  associated	
  with	
  a...
Example: Treated vs Untreated Mouse Severe Inflammation –
Gene Expression Dataset	
  
	
  
SIGORA	
  avoids	
  many	
  bio...
Future	
  challenges	
  and	
  opportuni0es	
  	
  
	
  
(using	
  bacterial	
  protein	
  localiza0on	
  as	
  an	
  exam...
Bacterial	
  protein	
  subcellular	
  localiza0on	
  predic0on	
  
•  Aids	
  genome	
  annota0on	
  and	
  predic0on	
  ...
Signal	
  pep0des:	
  Non-­‐cytoplasmic	
  
	
  
Amino	
  acid	
  composi0on/paperns:	
  All	
  localiza0ons	
  
	
  -­‐	
...
PSORTb:	
  version	
  3	
  
22
• Type	
  III	
  secre0on	
  apparatus	
  
• Pili/fimbria	
  
• Host-­‐associated	
  SCL	
  ...
Gram-­‐
nega6ve
SoNware Precision Recall
PSORTb	
  v3.0 96.8 88.0
PSORTb	
  v2.0 95.7 81.5
Gram-­‐
posi6ve
PSORTb	
  v3.0 ...
 
Classic	
  Gram	
  posi0ve	
  bacteria,	
  monoderms:	
  Thick	
  pep0doglycan,	
  no	
  outer	
  membrane	
  
Classic	
...
Challenge:	
  Temporal,	
  contextual	
  diversity	
  
Proteins	
  can	
  be	
  associated	
  with	
  mul0ple	
  subcellul...
Challenge:	
  Metagenomics	
  
High	
  demand	
  for	
  PSORTb	
  to	
  be	
  able	
  to	
  analyze	
  metagenomic	
  sequ...
 
	
  	
  
	
  	
  
	
  
	
  
Through	
  over	
  a	
  decade	
  of	
  cura0ng	
  for,	
  
making	
  and	
  evalua0ng	
  pr...
 
	
  	
  
	
  	
  
	
  
	
  
Through	
  over	
  a	
  decade	
  of	
  cura0ng	
  for,	
  
making	
  and	
  evalua0ng	
  pr...
Bioinforma0cs	
  Predictor’s	
  Code	
  of	
  Conduct	
  
-­‐	
  Never	
  force	
  predic0ons	
  -­‐	
  always	
  have	
  ...
Example	
  of	
  forced	
  predic0ons:	
  PSORT	
  I	
  predic0on	
  method	
  	
  
Nakai & Kanehisa, Proteins: Structure,...
Example	
  of	
  forced	
  predic0ons:	
  PSORT	
  I	
  predic0on	
  method	
  
Nakai & Kanehisa, Proteins: Structure, Fun...
Inspired	
  by	
  the	
  classic	
  “Data	
  Provider’s	
  Code	
  of	
  Conduct”	
  in	
  Stein	
  (2002)	
  Nature	
  41...
Bioinforma0cs	
  Predictor’s	
  Code	
  of	
  Conduct	
  -­‐	
  evalua*on	
  
33
	
  
-­‐	
  Evaluate	
  precision	
  and	...
What	
  we	
  MUST	
  do:	
  
	
  
Guide	
  users	
  to	
  not	
  just	
  blindly	
  use	
  a	
  predictor	
  and	
  its	
...
Brinkman	
  Lab	
  Kayaking	
  Trip,	
  Summer	
  2013
	
  	
  
(Next	
  up,	
  Archery	
  Tag!)
	
  	
  
Amir	
  Forousha...
Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities
Upcoming SlideShare
Loading in …5
×

Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

783 views

Published on

Fiona Brinkman talk at Automated Function Prediction SIG, ISMB 2014, Boston, MA, USA

Published in: Science, Technology
  • Be the first to comment

  • Be the first to like this

Making Protein Function and Subcellular Localization Predictions: Challenges and Opportunities

  1. 1. Making  protein  func0on  and  subcellular   localiza0on  predic0ons  –  challenges  and   opportuni0es   Fiona  Brinkman     Department  of  Molecular  Biology  and  Biochemistry   (Associate,  Faculty  of  Health  Sciences  and  School  of  Compu0ng  Sciences)   Simon  Fraser  University   Greater  Vancouver,  BC,  Canada     April  2014  
  2. 2. •  Improving  seq  similarity/orthology-­‐based  predic0ons  –  a  keystone   of  many  predictors     •  Improving  pathway/network-­‐based  analysis  to  iden0fy  protein   func0ons       •  Future  challenges  and  opportuni0es  (using  protein  localiza0on  as   an  example  of  what  is  to  come)                                                                      What  we  MUST  do  to  move  AFP  forward….   2  
  3. 3. 3       One-­‐to-­‐one  orthologs  are,  in  par0cular,  more  func0onally  similar  to   each  other,  vs  other  orthologs,  paralogs,  when  >80%  seq  iden0ty   Func0onal  similarity  measured  by  GO  annota0on  similarity  (13  species)   Altenhoff  AM  et  al.  PLoS  Comput  Biol.  2012  
  4. 4. 4       One-­‐to-­‐one  orthologs  are,  in  par0cular,  more  func0onally  similar  to   each  other,  vs  other  orthologs,  paralogs,  when  >80%  seq  iden0ty   Func0onal  similarity  measured  by  GO  annota0on  similarity  (13  species)   Altenhoff  AM  et  al.  PLoS  Comput  Biol.  2012  
  5. 5. 6   If  true  ortholog  is  missing…     (gene  loss,  or  incomplete  genome)     Ingroup1   Ingroup2   Outgroup   Species  Tree:   Gene  Tree:   Ingroup1   Ingroup2   Outgroup   RBBH   Reciprocal  Best  Blast  Hit    FAIL Gene  Tree:   Ingroup1   Outgroup   Ingroup2   Usual   Divergence   One  of  the  orthologous  genes   diverges  faster…     Paralog   RBBH   Paralog  
  6. 6. Ortholuge Uses  phyle0c  ra0os  to  differen0ate     Suppor0ng  Species  Divergence  (SSD)  orthologs     vs  proteins  more  divergent  than  expected  (non-­‐SSD)   7   Ra*o1   distance{ ingroup1-­‐ingroup2}   distance{ ingroup1-­‐outgroup }   Ingroup1   Ingroup2   Outgroup   SSD   Non-­‐SSD   Ortholuge  analysis  comparing  Burkholderia  cepacia   &  B.cenocepacia  (outgroup:  B.pseudomallei)   Ra*o2   distance{ ingroup1-­‐ingroup2}   distance{ ingroup2-­‐outgroup }   Ingroup1   Ingroup2   Outgroup   Whiteside  et  al  2013     PMID  23203876  
  7. 7. 0.000   0.200   0.400   0.600   0.800   1.000   KEGG   Orthology   Pfam  Domains   Tigrfam   Annota0ons   Subcellular   Localiza0ons   Propor*on   Predicted  Orthologs  in  600  Pairs  of  Bacterial  Species   SSD  Ortholog   Non-­‐SSD   8   *   *   *   *   *  p-­‐value  <  0.05   0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   One  or  more   homologs  (based  on   BLAST  hits)   Propor*on   SSD  orthologs   Non-­‐SSD   *   *  p-­‐value  <  0.05   Non-­‐SSD  “Orthologs”   more  likely:       -­‐  Func0onally  dissimilar     -­‐  Have  one  or  more   homologs  
  8. 8. A Database of Ortholuge Evaluations OrtholugeDB      (0nyurl.com/ortholugeDB)   •  Provides  pre-­‐computed  ortholog  predic0ons  for  >1400  bacteria   and  archaea  (update  coming  next  month!),  with  further     Ortholuge  assessments   •  Covers  all  genes  in  fully  sequenced  bacterial  and  archaeal  genomes   •  Facilitates  visualiza0on  and  evalua0on  of  ortholog  predic0ons   9  
  9. 9. Similar  issue  with  ini0al  metagenomics  seq   func0onal  evalua0on   1.  Simulated  reads  from  Pseudomonas  aeruginosa  PAO1   2.  Created  databases  at  different  levels  of  clade  exclusion   •  E.g.  for  species  clade  exclusion  removed  all  Pseudomonas     aeruginosa  genomes  from  the  database   3.  Used  RAPSearch2  and  MEGAN5  to  assign  func0onal   categories  to  the  simulated  reads   4.  Calculated  propor0on  of  reads  assigned  to  each  func0onal   category  rela0ve  to  how  many  reads  expected   •  E.g:   10   Category   Expected  #   assigned   Actual  #   assigned   Rela0ve   Propor0on   Membrane   Transport   567   583   1.02822  
  10. 10. Most  func0onal  categories  are  predicted  well   but  some  are  overpredicted  (ra0o  notably  >1)   0   0.5   1   1.5   2   2.5   Ra*o  of  assigned     rela*ve  to  expected   None   Species   Family   Class   Level of clade exclusion: Ie. Endocrine system: 3 problematic orthology groups – all with high #’s of proteins (one has 3538 when median is 54!)
  11. 11. The  rela0ve  propor0ons  of  func0onal  categories  stays   rela0vely  consistent  as  clade  exclusion  level  increases   0%   10%   20%   30%   40%   50%   60%   70%   80%   90%   100%   None   Species   Family   Class   Propor*on  of  reads  assigned   Clade  exclusion  level   Xenobio0cs  Biodegrada0on   and  Metabolism   Transcrip0on   Signal  Transduc0on   Replica0on  and  Repair   Infec0ous  Diseases   Nucleo0de  Metabolism   Neurodegenera0ve   Diseases   Metabolism  of  Other   Amino  Acids   Metabolism  of  Cofactors   and  Vitamins   Membrane  Transport   …
  12. 12. Improving  pathway-­‐based  analysis   Issue:  Biomolecular  pathway  classifica0ons  can  bias  analyses  of   pathways  found  to  be  upregulated  or  downregulated  by   transcriptome  (or  other  omics-­‐level)  analysis     What  you  iden0fy  depends  on  how  everything  is  classified….     Need  beper  “signatures”  of  pathways…  
  13. 13. Dealing  with  PART  of  the  issue…       Distribu0on  of  the  number  of  associated     pathways  for  human  genes  in  KEGG.   1 7-45 2 3 4 5 6                                                                                                             Membership  of  a  gene  in  mul0ple  pathways  is  the  norm,  not  the   excep0on…   Foroushani et al, 2014 PMCID: PMC3883547
  14. 14. Not  all  genes  are  equal…     Maroon:  pathway  member      White:  no  membership       All  genes  are  not   equivalent  signatures   of  a  given  pathway   Foroushani et al, 2014 PMCID: PMC3883547
  15. 15. Individual Gene ORA Antigen processing and presentation Graft-versus-host disease Natural killer cell mediated cytotoxicity Viral myocarditis Allograft rejection Cell adhesion molecules (CAMs) Chemokine signaling pathway Type I diabetes mellitus Toll-like receptor signaling pathway Cytokine-cytokine receptor interaction Example:  Treated  vs  Untreated  Mouse  Severe  InflammaIon  –   Gene  Expression  Dataset       Standard Over- Representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA) treat all genes in a given pathway as equal indicators that that pathway is significant. à Emphasizes generalist genes/ pathways Foroushani et al, 2014 PMCID: PMC3883547
  16. 16. Pathway  Signatures  using  SIGORA:  IdenIfying  genes/gene  pairs     uniquely  associated  with  a  single  pathway   SIGORA identifies statistically significant enrichment of Pathway Signatures in a gene list of interest. Foroushani et al, 2014 PMCID: PMC3883547
  17. 17. Example: Treated vs Untreated Mouse Severe Inflammation – Gene Expression Dataset     SIGORA  avoids  many  biologically  less  plausible  results  seen  by  other   methods  that  over-­‐emphasize  generalist  genes/pathways.   For example, 6/8 up-regulated genes in “Type I diabetes mellitus” pathway are also in the "Antigen processing and presentation" pathway. Individual Gene ORA SIGORA Antigen processing and presentation Antigen processing and presentation Graft-versus-host disease Natural killer cell mediated cytotoxicity Natural killer cell mediated cytotoxicity Complement and coagulation cascades Viral myocarditis Toll-like receptor signaling pathway Allograft rejection Cytokine-cytokine receptor interaction Cell adhesion molecules (CAMs) Leukocyte transendothelial migration Chemokine signaling pathway Cell adhesion molecules (CAMs) Type I diabetes mellitus Cytosolic DNA-sensing pathway Toll-like receptor signaling pathway Chemokine signaling pathway Cytokine-cytokine receptor interaction
  18. 18. Future  challenges  and  opportuni0es       (using  bacterial  protein  localiza0on  as  an  example     of  what  is  to  come)     (Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741) 19  
  19. 19. Bacterial  protein  subcellular  localiza0on  predic0on   •  Aids  genome  annota0on  and  predic0on  of  protein  func0on     •  Used  to  iden0fy  cell  surface/secreted  targets  for  drugs  and   diagnos0cs,  as  well  as  poten0al  vaccine  components   •  Many  pathogen-­‐associated  virulence  factors  predicted  as  secreted   (Gardy & Brinkman 2006 Nature Reviews Microbiology 4:741) 20  
  20. 20. Signal  pep0des:  Non-­‐cytoplasmic     Amino  acid  composi0on/paperns:  All  localiza0ons    -­‐  Support  Vector  Machine’s  trained  with  amino  acid                                      composi0ons  or  frequent  subsequences           Transmembrane  helices:  Cytoplasmic  membrane    -­‐  HMMTOP     PROSITE  mo0fs  with  100%  precision:  All  localiza0ons     Outer  membrane  mo0fs:  Outer  membrane    -­‐  Iden0fied  by  associa0on-­‐rule  mining       Homology  to  proteins  of  experimentally  known  localiza0on:  All  loc.    -­‐  “SCL-­‐BLAST”  against  pro  of  known  localiza0on    -­‐  E=10e-­‐10  and  length  restric0on  for  precision     Integra0on   with  a   Baysian   Network   Yu  et  al  (2010)  BioinformaIcs  26:1608     PSORTb:  bacterial  protein  subcellular   localiza0on  (SCL)  predic0on  sosware  
  21. 21. PSORTb:  version  3   22 • Type  III  secre0on  apparatus   • Pili/fimbria   • Host-­‐associated  SCL   • Flagellum   • Spore   • Gas  vesicle   Sub-­‐category  localiza0on  predic0ons   Main  localiza0ons  predicted   Bacteria  and  Archaea  predic0ons  
  22. 22. Gram-­‐ nega6ve SoNware Precision Recall PSORTb  v3.0 96.8 88.0 PSORTb  v2.0 95.7 81.5 Gram-­‐ posi6ve PSORTb  v3.0 97.0 93.2   PSORTb  v2.0 96.7 89.3 Archaea   PSORTb  v3.0 95.0   93.3   PSORTb  v3.0:  high  precision,  improved  sensi0vity/ recall  and  genome  predic0on  coverage   0   10   20   30   40   50   60   70   80   90   100   PSORTb  v.2. PSORTb  v.3. Five-­‐fold  cross  valida0on   Genome  predic0on  coverage   Gram-­‐negaIve   Gram-­‐posiIve   A  computa0onal  predictor  more  accurate  than  related  high-­‐throughput  lab  methods  
  23. 23.   Classic  Gram  posi0ve  bacteria,  monoderms:  Thick  pep0doglycan,  no  outer  membrane   Classic  Gram  nega0ve  bacteria,  diderms:  Thin  pep0doglycan  +  outer  membrane     …but  can  have  Gram  nega0ves  with  no  outer  membrane  (i.e.  Mycoplasma)     or  a  different  outer  membrane  (Synergistetes,  Sphingomonas),  or  Gram  posi0ve  (thick   peptdoglycan)  with  a  different  outer  membrane  (Deinococcus  –  6  layers  in  cell   envelope!),  or  “acid  fast”with  asymmetric  lipid-­‐containing  thick  cell  wall  (Mycobacteria)   Plus  bacterial  organelles  and  other  substructures   (ie.  magnetosome  of  Magnetospirillum)...     Solu*on:     -­‐   For  whole  genome  (deduced-­‐proteome)  analysis,        detect  key  protein  markers  of  a  par0cular  cell  type        (i.e.  Omp85  essen0al  for  classic  Gram  nega0ve  membrane)   -­‐  For  single  protein  analysis,  learn  from  above  analysis,  plus        literature  cura0on,  the  most  likely  cell  type  for  a  given  phyla                                                  …then  make  predic0ons  assuming  that  cell  “type”   Challenge:  Organismal  diversity     24 Reproduced under Fair Use
  24. 24. Challenge:  Temporal,  contextual  diversity   Proteins  can  be  associated  with  mul0ple  subcellular  localiza0ons                 i.e.  Cell  division  proteins,  Autotransporters,  “protein  A  dependant  on  protein  B”                               Solu0on:  Note  all  possible  localizaIons  since  Temporal,  contextual  predic0ons   non-­‐trivial  –  not  enough  knowledge  for  most   Kjærgaard K et al. J. Bacteriol. 2000;182:4789-4796
  25. 25. Challenge:  Metagenomics   High  demand  for  PSORTb  to  be  able  to  analyze  metagenomic  sequences   ….  under  development                 Need  taxonomy  data  to  aid  predic0ons     (then  enable  appropriate  cell  type  analysis)      
  26. 26.               Through  over  a  decade  of  cura0ng  for,   making  and  evalua0ng  predictors  of   protein  localiza0on,  genomic  islands,  etc       What  makes  a  great  predictor?        
  27. 27.               Through  over  a  decade  of  cura0ng  for,   making  and  evalua0ng  predictors  of   protein  localiza0on,  genomic  islands,  etc       What  makes  a  great  predictor?       (besides  it  being  right)    ☺    
  28. 28. Bioinforma0cs  Predictor’s  Code  of  Conduct   -­‐  Never  force  predic0ons  -­‐  always  have  a  predic0on  op0on/category  of        “unknown”                   Inspired  by  the  classic  “Data  Provider’s  Code  of  Conduct”  in  Stein  (2002)  Nature  417,  119-­‐120  
  29. 29. Example  of  forced  predic0ons:  PSORT  I  predic0on  method     Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69% What’s wrong here?
  30. 30. Example  of  forced  predic0ons:  PSORT  I  predic0on  method   Nakai & Kanehisa, Proteins: Structure, Function, Genetics (1991) Overall Accuracy = 69% No secreted/ extracellular localization!
  31. 31. Inspired  by  the  classic  “Data  Provider’s  Code  of  Conduct”  in  Stein  (2002)  Nature  417,  119-­‐120     -­‐  Never  force  predic0ons  -­‐  always  have  “unknown”  op0on/category             -­‐  Ensure  open  source  -­‐  enable  viewing  of  predic0on  method  details       -­‐   Predictor  should  easily  be  trainable  with  different  datasets          (if  applicable;  so  others  can  robustly  evaluate  accuracy)     -­‐   Have  ability  to  run  locally  or  over  web  (with  an  API  is  preferred)   -­‐   Provide  access  to  old  versions  (at  minimum  when  transi0oning        to  new  version)   -­‐  Encourage  con0nuing  cura0on  from  the  literature/lab  experiments!          Incorporate  some  curaIon  efforts  into  predictor  funding  applicaIons   Bioinforma0cs  Predictor’s  Code  of  Conduct  
  32. 32. Bioinforma0cs  Predictor’s  Code  of  Conduct  -­‐  evalua*on   33   -­‐  Evaluate  precision  and  recall  (and  accuracy  measure  combos  thereof)        with  x-­‐fold  cross  valida0on  and/or  new  datasets  (like  CAFA!)     -­‐   ID  errors,  biases  and  provide  guidance  to  users  re  issues  to  watch  for   -­‐   bias  in  training  and/or  tes0ng  datasets        (“homology  reduc0on”,  “clade  exclusion”  may  help)   -­‐  errors  in  “gold  standard”  lab-­‐based  measure   -­‐  contextual/temporal  changes  in  proteins,  impac0ng  predic0on        (ie.  Func0on  changes  when  another  protein/compound  present)           What  we  MUST  do:   Guide  users  to  not  just  blindly  use  a  predictor  and  its  default  output.    
  33. 33. What  we  MUST  do:     Guide  users  to  not  just  blindly  use  a  predictor  and  its  default  output.       Curators,  experimentalists,  and  automated  funcIon  predictor   developers  must  coordinate  efforts  more     •  Experimentalists  working  on  what     they  think  best…   •  Curators  cura0ng  what  they     priori0ze…   •  Func0on  predictors  op0mizing     predic0on  using  exis0ng  data….       FuncIon  predictors/bioinformaIcists  need  to  get  in  the  drivers  seat   more  for  research     Bioinforma0cs  Predictor’s  Code  of  Conduct  
  34. 34. Brinkman  Lab  Kayaking  Trip,  Summer  2013     (Next  up,  Archery  Tag!)     Amir  Foroushani   Maphew  Laird   David  Lynn   Raymond  Lo       Mike  Peabody   Thea  Van  Rossum   Maphew  Whiteside   Nancy  Yu    

×