STOP	
  Using	
  Just	
  GO:	
  
    A	
  Multi-­‐Ontology	
  Hypothesis	
  Generation	
  
   Tool	
  for	
  High	
  Throughput	
  Experimentation


 Tobias	
  Wittkop,	
  Emily	
  TerAvest,	
  Uday	
  Evani,	
  K.	
  Mathew	
  Fleisch,	
  Ari	
  E.	
  
       Berman,	
  Corey	
  Powell,	
  	
  Nigam	
  Shah	
  and	
  Sean	
  D.	
  Mooney


 Mooney	
  laboratory,	
  Buck	
  Institute	
  for	
  Research	
  on	
  Aging,	
  Novato,	
  CA
National	
  Center	
  for	
  Biomedical	
  Ontology,	
  Stanford	
  University,	
  Stanford,	
  
Gene	
  set	
  analysis



                                                       Hypothesis
 Experiment       List of genes
                                                       generation




• Microarray      A2M,	
  ABL1,	
  ADCY5,	
            • Pathway
                  AGPAT2,	
  AIFM1,	
  AKT1,	
  
                  APEX1,	
  APOC3,	
  APOE,	
  
                                                         analysis
• RNASeq          APP,	
  APTX,	
  AR,	
  
                  ARHGAP1,	
  ARNTL,	
  
                  ATF2,	
  ATM,	
  ATP5O,	
  
                                                       • Gene Ontology
• RNAi            ATR,	
  BAX,	
  BCL2,	
  BDNF,	
       enrichment
                  BLM,	
  BMI1,	
  BRCA1,	
  
                  BRCA2,	
  BSCL2,	
  BUB1B,	
           analysis
• Yeast-2-        BUB3,	
  CACNA1A,	
  CAT,	
  
  hybrid          CCNA2,	
  CDC2,	
  CDC42,	
  
                  CDKN2A,	
  CEBPA,	
                  • GSEA
                  CEBPB,	
  CHEK2,	
  CLOCK,	
  
• ...
Gene	
  annotations	
  outside	
  of	
  GO


• NCBO currently includes over 200 ontologies

• Manual curated gene-term annotation that are necessary for
  term enrichment are not available for different ontologies

• Gene/protein summary in Entrez Gene and UniProt often more
  up-to-date than manually curated GO

• NCBO provides annotator service that matches text to terms
Gene	
  annotations	
  outside	
  of	
  GO


• NCBO currently includes over 200 ontologies

• Manual curated gene-term annotation that are necessary for
  term enrichment are not available for different ontologies

• Gene/protein summary in Entrez Gene and UniProt often more
  up-to-date than manually curated GO

• NCBO provides annotator service that matches text to terms


       Idea: Use descriptive text for genes to
       retrieve up-to-date annotations from
       genes to a wide spectrum of ontologies
Automatic	
  gene	
  annotation	
  pipeline
Automatic	
  gene	
  annotation	
  pipeline

               Genome/Proteome*

                                  1.Collect genome/
        1"                          proteome from
                                    UCSC and UniProt
Automatic	
  gene	
  annotation	
  pipeline

                                                                                                                                   Genome/Proteome*

                                                                                                                                                      1.Collect genome/
                                                                                                                             1"                         proteome from
                                                                                                                                                        UCSC and UniProt

Q147X3''human'''''
The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'
Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables'
                                                                                                                              2"                      2.Collect descriptive
quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve'
atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal'
acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'
alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'
                                                                                                                                                        text for each gene/
apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'
targets' protein' complexes' and' co>regulates' majorcellular' funcIons.' ;>!>' FUNCTION:'
CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'
acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met>
Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of'
                                                                                                                                                        protein from Entrez
ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+'
CoA.>!>' SUBUNIT:' Component' of' the' N>terminal' acetyltransferase' C' (NatC)complex,'
which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'
Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'
isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'
                                                                                                                                                        Gene/UniProt
Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>'

SIMILARITY:' Belongs' to' the'                     acetyltransferase                                       ' family.'
MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:
122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H>
InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'
> . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;'
ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'
>.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'
Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'
HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;'
IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase'
acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'
I P R 0 1 6 1 8 1 ;' A c y l _ C o A _ a c y l t r a n s f e r a s e . G e n e 3 D ;' G 3 D S A : 3 . 4 0 . 6 3 0 . 3 0 ;'
Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'
Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.'
'
Automatic	
  gene	
  annotation	
  pipeline

                                                                                                                                                  Genome/Proteome*

                                                                                                                                                                                                                              1.Collect genome/
                                                                                                                             1"                                                                                                 proteome from
                                                                                                                                                                                                                                UCSC and UniProt

Q147X3''human'''''
The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'
Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables'
                                                                                                                              2"                                                                                              2.Collect descriptive
quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve'
atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal'
acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'
alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'
                                                                                                                                                                                                                                text for each gene/
apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'
targets' protein' complexes' and' co>regulates' majorcellular' funcIons.' ;>!>' FUNCTION:'
CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'
acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met>
Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of'
                                                                                                                                                                                                                                protein from Entrez
ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+'
CoA.>!>' SUBUNIT:' Component' of' the' N>terminal' acetyltransferase' C' (NatC)complex,'
which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'
Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'
isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'
                                                                                                                                                                                                                                Gene/UniProt
Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>'

SIMILARITY:' Belongs' to' the'                     acetyltransferase                                       ' family.'                                                                                 Gene$Ontology$




                                                                                                                             3"
MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:



                                                                                                                                                                                                                              3.Annotate text to
122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H>
InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'                                                                                                               Biological$                Molecular$
> . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;'                                                                      process$                   func6on$
ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'
>.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'
                                                                                                                                                                                                                 Cellular$
Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'
                                                                                                                                                                          Apoptosis$
                                                                                                                                                                                                                                over 200 ontologies
                                                                                                                                                                                                    signaling$   func6on$
HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;'
IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase'
acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'
I P R 0 1 6 1 8 1 ;' A c y l _ C o A _ a c y l t r a n s f e r a s e . G e n e 3 D ;' G 3 D S A : 3 . 4 0 . 6 3 0 . 3 0 ;'
Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'
                                                                                                                                                                  Cell$cycle$ontology$
Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.'
'

                                                                                                                                            Biological$process$         Biological$con6nuant$
                                                                                                                                                                                                                                via NCBO
                                                                                                                              DNA$
                                                                                                                                                    Cytokine6c$
                                                                                                                                                    process$                                                                    Annotator
                                                                                                                              replica6on$
                                                                                                                              ini6a6on$                       Acetyltransferase$
                                                                                                                              $
Gene/protein	
  specific	
  text	
  as	
  
                         annotation	
  source
                                             Q147X3''human'''''
• Gene text from Entrez Gene                 The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'
                                             Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables'
                                             quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve'
                                             atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal'
                                             acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'
• Protein text from UniProt                  alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'

                                             apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'
                                             targets' protein' complexes' and' co>regulates' majorcellular' funcIons.' ;>!>' FUNCTION:'
                                             CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'
  • Gene/Protein summary                     acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met>
                                             Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of'
                                             ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+'
                                             CoA.>!>' SUBUNIT:' Component' of' the' N>terminal' acetyltransferase' C' (NatC)complex,'
                                             which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'

  • Publication titles                       Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'
                                             isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'
                                             Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>'

                                             SIMILARITY:' Belongs' to' the'                     acetyltransferase                                       ' family.'
                                             MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:
  • GO annotations                           122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H>
                                             InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'
                                             > . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;'
                                             ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'
                                             >.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'
  • Pathway annotations                      Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'
                                             HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;'
                                             IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase'
                                             acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'
                                             I P R 0 1 6 1 8 1 ;' A c y l _ C o A _ a c y l t r a n s f e r a s e . G e n e 3 D ;' G 3 D S A : 3 . 4 0 . 6 3 0 . 3 0 ;'
  • GeneRIFs                                 Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'
                                             Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.'
                                             '



  • Protein complexes, domains, interactions

  • We filter for author names, db names, numbers
The	
  NCBO	
  annotator

• Simple string matching using
  mgrep

• Synonyms are annotated                                                                                                                                               Gene$Ontology$




                                                                                        3"
                                                                                                                                                       Biological$                Molecular$

• Annotations are propagated                                                                                                                           process$                   func6on$


                                                                                                                                                                                  Cellular$
                                                                                                                                           Apoptosis$                signaling$
  to the root                                                                                                                                                                     func6on$




                                                                                                                                   Cell$cycle$ontology$
• Mappings between terms
                                                                                                             Biological$process$         Biological$con6nuant$

  from different ontologies                                                                                          Cytokine6c$
                                                                                                                     process$
                                                                                               DNA$
                                                                                               replica6on$
                                                                                               ini6a6on$                       Acetyltransferase$
• No NLP                                                                                       $




• Very fast


1C.	
  Jonquet	
  et	
  al.	
  AMIA	
  Summit	
  on	
  Translational	
  Bioinformatics	
  (2009)
Annotation	
  results




561,577,156 annotations of 73,248 genes
and 146,271 proteins to 404,347 terms from
246 ontologies for 4 organism (human,
mouse, fly and worm)
10	
  Most	
  annotated	
  ontologies



              12,584,050
       13,177,577           35,767,064           SNOMED Clinical Terms
                                                 NCI Thesaurus
  13,341,529
                                                 NIFSTD
                                                 CRISP Thesaurus, 2006
13,526,445
                                                 Read Codes, Clinical Terms Version 3 (CTV3)
                                    34,137,453   Galen
14,541,946                                       Suggested Ontology for Pharmacogenomics
                                                 Gene Ontology Extension
    15,937,632                                   Human developmental anatomy, timed version
                           33,628,760            Gene Ontology
             16,338,723
STOP/GO	
  evaluation




• Compared genes/proteins GO annotation vs our GO
  annotations

• High recall 0.95-0.99 (we reproduce existing
  annotations)

• Lower precision 0.6-0.75 (we add new annotations)

• How good are the novel annotations?
Novel	
  GO	
  annotation	
  examples

          human protein carboxylesterase 1 (P23141).
                                             annotated to

           “cocaine metabolic process” (GO:0050783)
                                 based on title for a reference paper:

   “Structural basis of heroin and cocaine metabolism by a promiscuous human drug-processing enzyme”


C. elegans protein (Q27539) ATP-dependent Clp protease proteolytic subunit 1
                                            annotated to

“mitochondrial unfolded protein response” (GO:0034514)
                                based on title for a reference paper:

      “ClpP mediates activation of a mitochondrial unfolded protein response in C. elegans”
Critical	
  assessment	
  of	
  functional	
  
                     annotations	
  (CAFA)

  Mooneygroup was assessing-group for CAFA


• Collect all proteins from                • Collect novel experimental
  UniProt, that have no GO                   GO annotations for the same
  annotation with experimental               proteins
  evidence
                                           • Compare predictions with
• Submission of predicted GO                 novel experimentally
  annotations                                validated annotations




    January                                            June
     2010                                              2010
CAFA	
  results




  • Our	
  annotations	
  ranked	
  15th	
  out	
  of	
  40	
  when	
  predicting	
  Molecular	
  function	
  (MFO)	
  
    annotations:	
  F-­‐measure	
  0.4	
  (best	
  0.48)

  • Our	
  annotations	
  ranked	
  7th	
  out	
  of	
  36	
  when	
  predicting	
  Biological	
  process	
  (BPO)	
  
    annotations:	
  F-­‐measure	
  0.33	
  (best	
  0.35)

1http://biofunctionprediction.org/	
  
Gene	
  set	
  analysis



                                                       Hypothesis
 Experiment       List of genes
                                                       generation




• Microarray      A2M,	
  ABL1,	
  ADCY5,	
            • Pathway
                  AGPAT2,	
  AIFM1,	
  AKT1,	
  
                  APEX1,	
  APOC3,	
  APOE,	
  
                                                         analysis
• RNASeq          APP,	
  APTX,	
  AR,	
  
                  ARHGAP1,	
  ARNTL,	
  
                  ATF2,	
  ATM,	
  ATP5O,	
  
                                                       • Gene Ontology
• RNAi            ATR,	
  BAX,	
  BCL2,	
  BDNF,	
       enrichment
                  BLM,	
  BMI1,	
  BRCA1,	
  
                  BRCA2,	
  BSCL2,	
  BUB1B,	
           analysis
• Yeast-2-        BUB3,	
  CACNA1A,	
  CAT,	
  
  hybrid          CCNA2,	
  CDC2,	
  CDC42,	
  
                  CDKN2A,	
  CEBPA,	
                  • GSEA
                  CEBPB,	
  CHEK2,	
  CLOCK,	
  
• ...
Statistical	
  Tracking	
  of	
  Ontological	
  
                         Phrases	
  (STOP)

Enrichment analysis web application using automated annotations




                   http://mooneygroup.org/stop
Statistical	
  Tracking	
  of	
  Ontological	
  
                               Phrases	
  (STOP)

 Enrichment analysis web application using automated annotations

• Results as table




                         http://mooneygroup.org/stop
Statistical	
  Tracking	
  of	
  Ontological	
  
                               Phrases	
  (STOP)

 Enrichment analysis web application using automated annotations

• Results as table

• Termcloud




                         http://mooneygroup.org/stop
Statistical	
  Tracking	
  of	
  Ontological	
  
                               Phrases	
  (STOP)

 Enrichment analysis web application using automated annotations

• Results as table

• Termcloud

• Revisit previous results




                         http://mooneygroup.org/stop
Statistical	
  Tracking	
  of	
  Ontological	
  
                               Phrases	
  (STOP)

 Enrichment analysis web application using automated annotations

• Results as table

• Termcloud

• Revisit previous results

• Entrez gene id, gene
  symbol or UniProt id

• Custom background




                         http://mooneygroup.org/stop
Statistical	
  Tracking	
  of	
  Ontological	
  
                               Phrases	
  (STOP)

 Enrichment analysis web application using automated annotations

• Results as table

• Termcloud

• Revisit previous results

• Entrez gene id, gene
  symbol or UniProt id

• Custom background

• Filter results by ontology


                         http://mooneygroup.org/stop
Summary


          STOP using just GO .... because we provide


• Automated annotations of genes to terms from over 200 ontologies

• Enrichment analysis on novel annotations corrects for putative false
  positives and expands realm of testable hypothesis

• Easy-to-use web interface allows quick analysis and assists in
  navigation of results

• Human, worm, mouse,fly (...many more to come)

• http://mooneygroup.org/stop
Acknowledgements

                                                                                                                            http://mooneygroup.org/stop




Buck	
  Institute	
  for	
  Research	
  on	
  Aging
Sean	
  Mooney,	
  Emily	
  TerAvest,	
  Uday	
  Evani,	
  Ari	
  Berman,Tal	
  Oron	
  Ronnen,	
  Mathew	
  Fleisch,	
  Corey	
  Powell
	
                                       	
                                       	
                                        	
                                       	
                                       	
                                       	
                                       	
  
NCBO	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Funding
Nigam	
  Shah	
  and	
  Trish	
  Whetzel	
   	
                                                                                                                                                                                                                                                 	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  NIH	
  R01	
  LM009722	
  (PI:Mooney),	
  NIH	
  U54-­‐HG004028	
  (PI:	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Musen),	
  NIH	
  T32-­‐AG000266	
  (PIs:	
  Campisi,Ellerby),	
  NIH
CAFA	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  UL1DE019608	
  (PI:	
  Lithgow),	
  	
  NIH	
  RL9AG032114	
  (U54	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
Iddo	
  Friedberg,	
  Predrag	
  Radivojac	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Geroscience),	
  the	
  NCBO	
  and	
  the	
  Buck	
  Trust.	
  
Wyatt	
  Clark

Stop biocuration

  • 1.
    STOP  Using  Just  GO:   A  Multi-­‐Ontology  Hypothesis  Generation   Tool  for  High  Throughput  Experimentation Tobias  Wittkop,  Emily  TerAvest,  Uday  Evani,  K.  Mathew  Fleisch,  Ari  E.   Berman,  Corey  Powell,    Nigam  Shah  and  Sean  D.  Mooney Mooney  laboratory,  Buck  Institute  for  Research  on  Aging,  Novato,  CA National  Center  for  Biomedical  Ontology,  Stanford  University,  Stanford,  
  • 2.
    Gene  set  analysis Hypothesis Experiment List of genes generation • Microarray A2M,  ABL1,  ADCY5,   • Pathway AGPAT2,  AIFM1,  AKT1,   APEX1,  APOC3,  APOE,   analysis • RNASeq APP,  APTX,  AR,   ARHGAP1,  ARNTL,   ATF2,  ATM,  ATP5O,   • Gene Ontology • RNAi ATR,  BAX,  BCL2,  BDNF,   enrichment BLM,  BMI1,  BRCA1,   BRCA2,  BSCL2,  BUB1B,   analysis • Yeast-2- BUB3,  CACNA1A,  CAT,   hybrid CCNA2,  CDC2,  CDC42,   CDKN2A,  CEBPA,   • GSEA CEBPB,  CHEK2,  CLOCK,   • ...
  • 3.
    Gene  annotations  outside  of  GO • NCBO currently includes over 200 ontologies • Manual curated gene-term annotation that are necessary for term enrichment are not available for different ontologies • Gene/protein summary in Entrez Gene and UniProt often more up-to-date than manually curated GO • NCBO provides annotator service that matches text to terms
  • 4.
    Gene  annotations  outside  of  GO • NCBO currently includes over 200 ontologies • Manual curated gene-term annotation that are necessary for term enrichment are not available for different ontologies • Gene/protein summary in Entrez Gene and UniProt often more up-to-date than manually curated GO • NCBO provides annotator service that matches text to terms Idea: Use descriptive text for genes to retrieve up-to-date annotations from genes to a wide spectrum of ontologies
  • 5.
  • 6.
    Automatic  gene  annotation  pipeline Genome/Proteome* 1.Collect genome/ 1" proteome from UCSC and UniProt
  • 7.
    Automatic  gene  annotation  pipeline Genome/Proteome* 1.Collect genome/ 1" proteome from UCSC and UniProt Q147X3''human''''' The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the' Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables' 2" 2.Collect descriptive quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve' atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal' acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N' alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent' text for each gene/ apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon' targets' protein' complexes' and' co>regulates' majorcellular' funcIons.' ;>!>' FUNCTION:' CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes' acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met> Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of' protein from Entrez ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+' CoA.>!>' SUBUNIT:' Component' of' the' N>terminal' acetyltransferase' C' (NatC)complex,' which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:' Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named' isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;' Gene/UniProt Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>' SIMILARITY:' Belongs' to' the' acetyltransferase ' family.' MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa: 122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H> InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;' > . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;' ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;' >.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;' Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;' HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;' IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase' acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;' I P R 0 1 6 1 8 1 ;' A c y l _ C o A _ a c y l t r a n s f e r a s e . G e n e 3 D ;' G 3 D S A : 3 . 4 0 . 6 3 0 . 3 0 ;' Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;' Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.' '
  • 8.
    Automatic  gene  annotation  pipeline Genome/Proteome* 1.Collect genome/ 1" proteome from UCSC and UniProt Q147X3''human''''' The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the' Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables' 2" 2.Collect descriptive quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve' atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal' acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N' alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent' text for each gene/ apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon' targets' protein' complexes' and' co>regulates' majorcellular' funcIons.' ;>!>' FUNCTION:' CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes' acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met> Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of' protein from Entrez ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+' CoA.>!>' SUBUNIT:' Component' of' the' N>terminal' acetyltransferase' C' (NatC)complex,' which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:' Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named' isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;' Gene/UniProt Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>' SIMILARITY:' Belongs' to' the' acetyltransferase ' family.' Gene$Ontology$ 3" MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa: 3.Annotate text to 122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H> InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;' Biological$ Molecular$ > . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;' process$ func6on$ ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;' >.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;' Cellular$ Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;' Apoptosis$ over 200 ontologies signaling$ func6on$ HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;' IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase' acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;' I P R 0 1 6 1 8 1 ;' A c y l _ C o A _ a c y l t r a n s f e r a s e . G e n e 3 D ;' G 3 D S A : 3 . 4 0 . 6 3 0 . 3 0 ;' Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;' Cell$cycle$ontology$ Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.' ' Biological$process$ Biological$con6nuant$ via NCBO DNA$ Cytokine6c$ process$ Annotator replica6on$ ini6a6on$ Acetyltransferase$ $
  • 9.
    Gene/protein  specific  text  as   annotation  source Q147X3''human''''' • Gene text from Entrez Gene The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the' Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables' quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve' atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal' acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N' • Protein text from UniProt alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent' apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon' targets' protein' complexes' and' co>regulates' majorcellular' funcIons.' ;>!>' FUNCTION:' CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes' • Gene/Protein summary acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met> Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of' ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+' CoA.>!>' SUBUNIT:' Component' of' the' N>terminal' acetyltransferase' C' (NatC)complex,' which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:' • Publication titles Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named' isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;' Sequence=VSP_031581;Note=No' experimental' confirmaIon' available;>!>' SIMILARITY:' Belongs' to' the' acetyltransferase ' family.' MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa: • GO annotations 122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H> InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;' > . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;' ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;' >.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;' • Pathway annotations Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;' HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;' IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase' acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;' I P R 0 1 6 1 8 1 ;' A c y l _ C o A _ a c y l t r a n s f e r a s e . G e n e 3 D ;' G 3 D S A : 3 . 4 0 . 6 3 0 . 3 0 ;' • GeneRIFs Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;' Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.' ' • Protein complexes, domains, interactions • We filter for author names, db names, numbers
  • 10.
    The  NCBO  annotator •Simple string matching using mgrep • Synonyms are annotated Gene$Ontology$ 3" Biological$ Molecular$ • Annotations are propagated process$ func6on$ Cellular$ Apoptosis$ signaling$ to the root func6on$ Cell$cycle$ontology$ • Mappings between terms Biological$process$ Biological$con6nuant$ from different ontologies Cytokine6c$ process$ DNA$ replica6on$ ini6a6on$ Acetyltransferase$ • No NLP $ • Very fast 1C.  Jonquet  et  al.  AMIA  Summit  on  Translational  Bioinformatics  (2009)
  • 11.
    Annotation  results 561,577,156 annotationsof 73,248 genes and 146,271 proteins to 404,347 terms from 246 ontologies for 4 organism (human, mouse, fly and worm)
  • 12.
    10  Most  annotated  ontologies 12,584,050 13,177,577 35,767,064 SNOMED Clinical Terms NCI Thesaurus 13,341,529 NIFSTD CRISP Thesaurus, 2006 13,526,445 Read Codes, Clinical Terms Version 3 (CTV3) 34,137,453 Galen 14,541,946 Suggested Ontology for Pharmacogenomics Gene Ontology Extension 15,937,632 Human developmental anatomy, timed version 33,628,760 Gene Ontology 16,338,723
  • 13.
    STOP/GO  evaluation • Comparedgenes/proteins GO annotation vs our GO annotations • High recall 0.95-0.99 (we reproduce existing annotations) • Lower precision 0.6-0.75 (we add new annotations) • How good are the novel annotations?
  • 14.
    Novel  GO  annotation  examples human protein carboxylesterase 1 (P23141). annotated to “cocaine metabolic process” (GO:0050783) based on title for a reference paper: “Structural basis of heroin and cocaine metabolism by a promiscuous human drug-processing enzyme” C. elegans protein (Q27539) ATP-dependent Clp protease proteolytic subunit 1 annotated to “mitochondrial unfolded protein response” (GO:0034514) based on title for a reference paper: “ClpP mediates activation of a mitochondrial unfolded protein response in C. elegans”
  • 15.
    Critical  assessment  of  functional   annotations  (CAFA) Mooneygroup was assessing-group for CAFA • Collect all proteins from • Collect novel experimental UniProt, that have no GO GO annotations for the same annotation with experimental proteins evidence • Compare predictions with • Submission of predicted GO novel experimentally annotations validated annotations January June 2010 2010
  • 16.
    CAFA  results • Our  annotations  ranked  15th  out  of  40  when  predicting  Molecular  function  (MFO)   annotations:  F-­‐measure  0.4  (best  0.48) • Our  annotations  ranked  7th  out  of  36  when  predicting  Biological  process  (BPO)   annotations:  F-­‐measure  0.33  (best  0.35) 1http://biofunctionprediction.org/  
  • 17.
    Gene  set  analysis Hypothesis Experiment List of genes generation • Microarray A2M,  ABL1,  ADCY5,   • Pathway AGPAT2,  AIFM1,  AKT1,   APEX1,  APOC3,  APOE,   analysis • RNASeq APP,  APTX,  AR,   ARHGAP1,  ARNTL,   ATF2,  ATM,  ATP5O,   • Gene Ontology • RNAi ATR,  BAX,  BCL2,  BDNF,   enrichment BLM,  BMI1,  BRCA1,   BRCA2,  BSCL2,  BUB1B,   analysis • Yeast-2- BUB3,  CACNA1A,  CAT,   hybrid CCNA2,  CDC2,  CDC42,   CDKN2A,  CEBPA,   • GSEA CEBPB,  CHEK2,  CLOCK,   • ...
  • 18.
    Statistical  Tracking  of  Ontological   Phrases  (STOP) Enrichment analysis web application using automated annotations http://mooneygroup.org/stop
  • 19.
    Statistical  Tracking  of  Ontological   Phrases  (STOP) Enrichment analysis web application using automated annotations • Results as table http://mooneygroup.org/stop
  • 20.
    Statistical  Tracking  of  Ontological   Phrases  (STOP) Enrichment analysis web application using automated annotations • Results as table • Termcloud http://mooneygroup.org/stop
  • 21.
    Statistical  Tracking  of  Ontological   Phrases  (STOP) Enrichment analysis web application using automated annotations • Results as table • Termcloud • Revisit previous results http://mooneygroup.org/stop
  • 22.
    Statistical  Tracking  of  Ontological   Phrases  (STOP) Enrichment analysis web application using automated annotations • Results as table • Termcloud • Revisit previous results • Entrez gene id, gene symbol or UniProt id • Custom background http://mooneygroup.org/stop
  • 23.
    Statistical  Tracking  of  Ontological   Phrases  (STOP) Enrichment analysis web application using automated annotations • Results as table • Termcloud • Revisit previous results • Entrez gene id, gene symbol or UniProt id • Custom background • Filter results by ontology http://mooneygroup.org/stop
  • 24.
    Summary STOP using just GO .... because we provide • Automated annotations of genes to terms from over 200 ontologies • Enrichment analysis on novel annotations corrects for putative false positives and expands realm of testable hypothesis • Easy-to-use web interface allows quick analysis and assists in navigation of results • Human, worm, mouse,fly (...many more to come) • http://mooneygroup.org/stop
  • 25.
    Acknowledgements http://mooneygroup.org/stop Buck  Institute  for  Research  on  Aging Sean  Mooney,  Emily  TerAvest,  Uday  Evani,  Ari  Berman,Tal  Oron  Ronnen,  Mathew  Fleisch,  Corey  Powell                 NCBO                                                                                                                                              Funding Nigam  Shah  and  Trish  Whetzel                        NIH  R01  LM009722  (PI:Mooney),  NIH  U54-­‐HG004028  (PI:                                                                                                                                                                          Musen),  NIH  T32-­‐AG000266  (PIs:  Campisi,Ellerby),  NIH CAFA                                                                                                                                                UL1DE019608  (PI:  Lithgow),    NIH  RL9AG032114  (U54                                           Iddo  Friedberg,  Predrag  Radivojac                                          Geroscience),  the  NCBO  and  the  Buck  Trust.   Wyatt  Clark