Stop biocuration

404 views

Published on

This talk I gave at the Biocuration 2012 conference. It describes our method STOP (Statistical tracking of ontological phrases), a web tool for gene set enrichment analysis using multiple ontologies.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Stop biocuration

  1. 1. STOP  Using  Just  GO:   A  Multi-­‐Ontology  Hypothesis  Generation   Tool  for  High  Throughput  Experimentation Tobias  Wittkop,  Emily  TerAvest,  Uday  Evani,  K.  Mathew  Fleisch,  Ari  E.   Berman,  Corey  Powell,    Nigam  Shah  and  Sean  D.  Mooney Mooney  laboratory,  Buck  Institute  for  Research  on  Aging,  Novato,  CA National  Center  for  Biomedical  Ontology,  Stanford  University,  Stanford,  
  2. 2. Experiment List of genes Hypothesis generation • Microarray • RNASeq • RNAi • Yeast-2- hybrid • ... A2M,  ABL1,  ADCY5,   AGPAT2,  AIFM1,  AKT1,   APEX1,  APOC3,  APOE,   APP,  APTX,  AR,   ARHGAP1,  ARNTL,   ATF2,  ATM,  ATP5O,   ATR,  BAX,  BCL2,  BDNF,   BLM,  BMI1,  BRCA1,   BRCA2,  BSCL2,  BUB1B,   BUB3,  CACNA1A,  CAT,   CCNA2,  CDC2,  CDC42,   CDKN2A,  CEBPA,   CEBPB,  CHEK2,  CLOCK,   • Pathway analysis • Gene Ontology enrichment analysis • GSEA Gene  set  analysis
  3. 3. Gene  annotations  outside  of  GO • NCBO currently includes over 200 ontologies • Manual curated gene-term annotation that are necessary for term enrichment are not available for different ontologies • Gene/protein summary in Entrez Gene and UniProt often more up-to-date than manually curated GO • NCBO provides annotator service that matches text to terms
  4. 4. Gene  annotations  outside  of  GO • NCBO currently includes over 200 ontologies • Manual curated gene-term annotation that are necessary for term enrichment are not available for different ontologies • Gene/protein summary in Entrez Gene and UniProt often more up-to-date than manually curated GO • NCBO provides annotator service that matches text to terms Idea: Use descriptive text for genes to retrieve up-to-date annotations from genes to a wide spectrum of ontologies
  5. 5. Automatic  gene  annotation  pipeline
  6. 6. 1" Genome/Proteome* 1.Collect genome/ proteome from UCSC and UniProt Automatic  gene  annotation  pipeline
  7. 7. 1" Genome/Proteome* Q147X3''human''''' The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the' Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables' quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve' atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal' acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N' alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent' apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon' targets'protein'complexes'and'co>regulates'majorcellular'funcIons.';>!>'FUNCTION:' CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes' acetylaIon'of'the'N>terminal'methionineresidues'of'pepIdes'beginning'with'Met> Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of' ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+' CoA.>!>'SUBUNIT:'Component'of'the'N>terminal'acetyltransferase'C'(NatC)complex,' which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:' Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named' isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;' Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>' SIMILARITY:' Belongs' to' the' acetyltransferase' family.' MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa: 122830;'>.UCSC;'uc001xcx.2;'human.CTD;'122830;'>.GeneCards;'GC14P038022;'>.H> InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;' >.PharmGKB;' PA134931315;' >.eggNOG;' prNOG15463;' >.GeneTree;' ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;' >.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;' Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;' HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;' IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase' acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;' IPR016181;' Acyl_CoA_acyltransferase.Gene3D;' G3DSA:3.40.630.30;' Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;' Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.' ' 2" 1.Collect genome/ proteome from UCSC and UniProt 2.Collect descriptive text for each gene/ protein from Entrez Gene/UniProt Automatic  gene  annotation  pipeline
  8. 8. 1" Genome/Proteome* Q147X3''human''''' The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the' Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables' quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve' atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal' acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N' alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent' apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon' targets'protein'complexes'and'co>regulates'majorcellular'funcIons.';>!>'FUNCTION:' CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes' acetylaIon'of'the'N>terminal'methionineresidues'of'pepIdes'beginning'with'Met> Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of' ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+' CoA.>!>'SUBUNIT:'Component'of'the'N>terminal'acetyltransferase'C'(NatC)complex,' which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:' Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named' isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;' Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>' SIMILARITY:' Belongs' to' the' acetyltransferase' family.' MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa: 122830;'>.UCSC;'uc001xcx.2;'human.CTD;'122830;'>.GeneCards;'GC14P038022;'>.H> InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;' >.PharmGKB;' PA134931315;' >.eggNOG;' prNOG15463;' >.GeneTree;' ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;' >.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;' Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;' HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;' IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase' acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;' IPR016181;' Acyl_CoA_acyltransferase.Gene3D;' G3DSA:3.40.630.30;' Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;' Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.' ' 2" Gene$Ontology$ Biological$ process$ Apoptosis$ signaling$ Molecular$ func6on$ Cellular$ func6on$3" Cell$cycle$ontology$ Biological$process$ DNA$ replica6on$ ini6a6on$ $ Cytokine6c$ process$ Biological$con6nuant$ Acetyltransferase$ 1.Collect genome/ proteome from UCSC and UniProt 2.Collect descriptive text for each gene/ protein from Entrez Gene/UniProt 3.Annotate text to over 200 ontologies via NCBO Annotator Automatic  gene  annotation  pipeline
  9. 9. Gene/protein  specific  text  as   annotation  source • Gene text from Entrez Gene • Protein text from UniProt • Gene/Protein summary • Publication titles • GO annotations • Pathway annotations • GeneRIFs • Protein complexes, domains, interactions • We filter for author names, db names, numbers Q147X3''human''''' The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the' Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables' quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve' atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal' acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N' alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent' apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon' targets'protein'complexes'and'co>regulates'majorcellular'funcIons.';>!>'FUNCTION:' CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes' acetylaIon'of'the'N>terminal'methionineresidues'of'pepIdes'beginning'with'Met> Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of' ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+' CoA.>!>'SUBUNIT:'Component'of'the'N>terminal'acetyltransferase'C'(NatC)complex,' which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:' Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named' isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;' Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>' SIMILARITY:' Belongs' to' the' acetyltransferase' family.' MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa: 122830;'>.UCSC;'uc001xcx.2;'human.CTD;'122830;'>.GeneCards;'GC14P038022;'>.H> InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;' >.PharmGKB;' PA134931315;' >.eggNOG;' prNOG15463;' >.GeneTree;' ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;' >.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;' Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;' HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;' IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase' acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;' IPR016181;' Acyl_CoA_acyltransferase.Gene3D;' G3DSA:3.40.630.30;' Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;' Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.' '
  10. 10. The  NCBO  annotator • Simple string matching using mgrep • Synonyms are annotated • Annotations are propagated to the root • Mappings between terms from different ontologies • No NLP • Very fast Gene$Ontology$ Biological$ process$ Apoptosis$ signaling$ Molecular$ func6on$ Cellular$ func6on$3" Cell$cycle$ontology$ Biological$process$ DNA$ replica6on$ ini6a6on$ $ Cytokine6c$ process$ Biological$con6nuant$ Acetyltransferase$ 1C.  Jonquet  et  al.  AMIA  Summit  on  Translational  Bioinformatics  (2009)
  11. 11. 561,577,156 annotations of 73,248 genes and 146,271 proteins to 404,347 terms from 246 ontologies for 4 organism (human, mouse, fly and worm) Annotation  results
  12. 12. 10  Most  annotated  ontologies 12,584,050 13,177,577 13,341,529 13,526,445 14,541,946 15,937,632 16,338,723 33,628,760 34,137,453 35,767,064 SNOMED Clinical Terms NCI Thesaurus NIFSTD CRISP Thesaurus, 2006 Read Codes, Clinical Terms Version 3 (CTV3) Galen Suggested Ontology for Pharmacogenomics Gene Ontology Extension Human developmental anatomy, timed version Gene Ontology
  13. 13. STOP/GO  evaluation • Compared genes/proteins GO annotation vs our GO annotations • High recall 0.95-0.99 (we reproduce existing annotations) • Lower precision 0.6-0.75 (we add new annotations) • How good are the novel annotations?
  14. 14. Novel  GO  annotation  examples human protein carboxylesterase 1 (P23141). annotated to “cocaine metabolic process” (GO:0050783) based on title for a reference paper: “Structural basis of heroin and cocaine metabolism by a promiscuous human drug-processing enzyme” C. elegans protein (Q27539) ATP-dependent Clp protease proteolytic subunit 1 annotated to “mitochondrial unfolded protein response” (GO:0034514) based on title for a reference paper: “ClpP mediates activation of a mitochondrial unfolded protein response in C. elegans”
  15. 15. Critical  assessment  of  functional   annotations  (CAFA) January 2010 June 2010 • Collect all proteins from UniProt, that have no GO annotation with experimental evidence • Submission of predicted GO annotations • Collect novel experimental GO annotations for the same proteins • Compare predictions with novel experimentally validated annotations Mooneygroup was assessing-group for CAFA
  16. 16. CAFA  results • Our  annotations  ranked  15th  out  of  40  when  predicting  Molecular  function  (MFO)   annotations:  F-­‐measure  0.4  (best  0.48) • Our  annotations  ranked  7th  out  of  36  when  predicting  Biological  process  (BPO)   annotations:  F-­‐measure  0.33  (best  0.35) 1http://biofunctionprediction.org/  
  17. 17. Experiment List of genes Hypothesis generation • Microarray • RNASeq • RNAi • Yeast-2- hybrid • ... A2M,  ABL1,  ADCY5,   AGPAT2,  AIFM1,  AKT1,   APEX1,  APOC3,  APOE,   APP,  APTX,  AR,   ARHGAP1,  ARNTL,   ATF2,  ATM,  ATP5O,   ATR,  BAX,  BCL2,  BDNF,   BLM,  BMI1,  BRCA1,   BRCA2,  BSCL2,  BUB1B,   BUB3,  CACNA1A,  CAT,   CCNA2,  CDC2,  CDC42,   CDKN2A,  CEBPA,   CEBPB,  CHEK2,  CLOCK,   • Pathway analysis • Gene Ontology enrichment analysis • GSEA Gene  set  analysis
  18. 18. Statistical  Tracking  of  Ontological   Phrases  (STOP) http://mooneygroup.org/stop Enrichment analysis web application using automated annotations
  19. 19. Statistical  Tracking  of  Ontological   Phrases  (STOP) http://mooneygroup.org/stop • Results as table Enrichment analysis web application using automated annotations
  20. 20. Statistical  Tracking  of  Ontological   Phrases  (STOP) http://mooneygroup.org/stop • Results as table • Termcloud Enrichment analysis web application using automated annotations
  21. 21. Statistical  Tracking  of  Ontological   Phrases  (STOP) http://mooneygroup.org/stop • Results as table • Termcloud • Revisit previous results Enrichment analysis web application using automated annotations
  22. 22. Statistical  Tracking  of  Ontological   Phrases  (STOP) http://mooneygroup.org/stop • Results as table • Termcloud • Revisit previous results • Entrez gene id, gene symbol or UniProt id • Custom background Enrichment analysis web application using automated annotations
  23. 23. Statistical  Tracking  of  Ontological   Phrases  (STOP) http://mooneygroup.org/stop • Results as table • Termcloud • Revisit previous results • Entrez gene id, gene symbol or UniProt id • Custom background • Filter results by ontology Enrichment analysis web application using automated annotations
  24. 24. Summary STOP using just GO .... because we provide • Automated annotations of genes to terms from over 200 ontologies • Enrichment analysis on novel annotations corrects for putative false positives and expands realm of testable hypothesis • Easy-to-use web interface allows quick analysis and assists in navigation of results • Human, worm, mouse,fly (...many more to come) • http://mooneygroup.org/stop
  25. 25. Acknowledgements Buck  Institute  for  Research  on  Aging Sean  Mooney,  Emily  TerAvest,  Uday  Evani,  Ari  Berman,Tal  Oron  Ronnen,  Mathew  Fleisch,  Corey  Powell                 NCBO                                                                                                                                              Funding Nigam  Shah  and  Trish  Whetzel                        NIH  R01  LM009722  (PI:Mooney),  NIH  U54-­‐HG004028  (PI:                                                                                                                                                                          Musen),  NIH  T32-­‐AG000266  (PIs:  Campisi,Ellerby),  NIH CAFA                                                                                                                                                UL1DE019608  (PI:  Lithgow),    NIH  RL9AG032114  (U54                                             Iddo  Friedberg,  Predrag  Radivojac                                          Geroscience),  the  NCBO  and  the  Buck  Trust.   Wyatt  Clark http://mooneygroup.org/stop

×