Stop biocuration

STOP
Using
Just
GO:

A
Multi-‐Ontology
Hypothesis
Generation

Tool
for
High
Throughput
Experimentation

Tobias
Wittkop,
Emily
TerAvest,
Uday
Evani,
K.
Mathew
Fleisch,
Ari
E.

Berman,
Corey
Powell,

Nigam
Shah
and
Sean
D.
Mooney

Mooney
laboratory,
Buck
Institute
for
Research
on
Aging,
Novato,
CA
National
Center
for
Biomedical
Ontology,
Stanford
University,
Stanford,

Gene
set
analysis

Hypothesis
Experiment List of genes
generation

• Microarray A2M,
ABL1,
ADCY5,
• Pathway
AGPAT2,
AIFM1,
AKT1,

APEX1,
APOC3,
APOE,

analysis
• RNASeq APP,
APTX,
AR,

ARHGAP1,
ARNTL,

ATF2,
ATM,
ATP5O,

• Gene Ontology
• RNAi ATR,
BAX,
BCL2,
BDNF,
enrichment
BLM,
BMI1,
BRCA1,

BRCA2,
BSCL2,
BUB1B,
analysis
• Yeast-2- BUB3,
CACNA1A,
CAT,

hybrid CCNA2,
CDC2,
CDC42,

CDKN2A,
CEBPA,
• GSEA
CEBPB,
CHEK2,
CLOCK,

• ...

Gene
annotations
outside
of
GO

• NCBO currently includes over 200 ontologies

• Manual curated gene-term annotation that are necessary for
term enrichment are not available for different ontologies

• Gene/protein summary in Entrez Gene and UniProt often more
up-to-date than manually curated GO

• NCBO provides annotator service that matches text to terms

Gene
annotations
outside
of
GO

• NCBO currently includes over 200 ontologies

• Manual curated gene-term annotation that are necessary for
term enrichment are not available for different ontologies

• Gene/protein summary in Entrez Gene and UniProt often more
up-to-date than manually curated GO

• NCBO provides annotator service that matches text to terms

Idea: Use descriptive text for genes to
retrieve up-to-date annotations from
genes to a wide spectrum of ontologies

Automatic
gene
annotation
pipeline

Automatic
gene
annotation
pipeline

Genome/Proteome*

1.Collect genome/
1" proteome from
UCSC and UniProt

Automatic
gene
annotation
pipeline

Genome/Proteome*

1.Collect genome/
1" proteome from
UCSC and UniProt

Q147X3''human'''''
The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'
Mammalian' Gene' CollecIon' (MGC).' ;' Kinase>selecIve' enrichment' enables'
2" 2.Collect descriptive
quanItaIve'phosphoproteomics'oQhe'kinome'across'the'cell'cycle.';'A'quanItaIve'
atlas' of' mitoIc' phosphorylaIon.' ;' A' synopsis' of' eukaryoIc' Nalpha>terminal'
acetyltransferases:nomenclature,'subunits'and'substrates.';'Knockdown'of'human'N'
alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'
text for each gene/
apoptosis' and' aberrant' human' Arl8b' localizaIon.' ;' Lysine' acetylaIon'
targets' protein' complexes' and' co>regulates' majorcellular' funcIons.' ;>!>' FUNCTION:'
CatalyIc' subunit' of' the' N>terminal' acetyltransferase' C(NatC)' complex.' Catalyzes'
acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met>
Leu>Ala' and' Met>Leu>Gly.Necessary' for' the' lysosomal' localizaIon' and' funcIon' of'
protein from Entrez
ARL8B.>!>' CATALYTIC' ACTIVITY:' Acetyl>CoA' +' pepIde' =' N(alpha)>acetylpepIde+'
CoA.>!>' SUBUNIT:' Component' of' the' N>terminal' acetyltransferase' C' (NatC)complex,'
which' is' composed' of' NAA35,' LSMD1' and' NAA30.>!>' SUBCELLULAR' LOCATION:'
Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'
isoforms=2;Name=1;IsoId=Q147X3>1;'Sequence=Displayed;Name=2;IsoId=Q147X3>2;'
Gene/UniProt
Sequence=VSP_031581;Note=No' experimental' conﬁrmaIon' available;>!>'

SIMILARITY:' Belongs' to' the' acetyltransferase ' family.'
MAK3subfamily.>!>'SIMILARITY:'Contains'1'N>acetyltransferase'domain.'>.KEGG;'hsa:
122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H>
InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'
> . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;'
ENSGT00390000005665;' >.HOGENOM;' HBG282398;' >.HOVERGEN;' HBG082671;'
>.InParanoid;' Q147X3;' >.OMA;' AGVHSGE;' >.OrthoDB;' EOG4KKZ4S;' >.PhylomeDB;'
Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'
HS_NAT12;' >.GenevesIgator;' Q147X3;' >.GO;' GO:0005737;' C:cytoplasm;'
IEA:UniProtKB>SubCell.GO;' GO:0004596;' F:pepIde' alpha>N>acetyltransferase'
acIvity;' IEA:EC.InterPro;' IPR000182;' AcTrfase_GCN5>related_dom.InterPro;'
I P R 0 1 6 1 8 1 ;' A c y l _ C o A _ a c y l t r a n s f e r a s e . G e n e 3 D ;' G 3 D S A : 3 . 4 0 . 6 3 0 . 3 0 ;'
Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'
Acyl_CoA_acyltransferase;'1.PROSITE;'PS51186;'GNAT;'1.'
'

Automatic
gene
annotation
pipeline

Genome/Proteome*

1.Collect genome/
1" proteome from
UCSC and UniProt

Q147X3''human'''''
The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'
2" 2.Collect descriptive
alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'
text for each gene/
acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met>
protein from Entrez
Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'
Gene/UniProt

SIMILARITY:' Belongs' to' the' acetyltransferase ' family.' Gene$Ontology$

3"

3.Annotate text to
122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H>
InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;' Biological$ Molecular$
> . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;' process$ func6on$
Cellular$
Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'
Apoptosis$
over 200 ontologies
signaling$ func6on$
Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'
Cell$cycle$ontology$
'

Biological$process$ Biological$con6nuant$
via NCBO
DNA$
Cytokine6c$
process$ Annotator
replica6on$
ini6a6on$ Acetyltransferase$
$

Gene/protein
specific
text
as

annotation
source
Q147X3''human'''''
• Gene text from Entrez Gene The' status,' quality,' and' expansion' of' the' NIH' full>length' cDNAproject:' the'
• Protein text from UniProt alpha>terminal' acetyltransferase' complex' C' leadsto' p53>dependent'

• Gene/Protein summary acetylaIon' of' the' N>terminal' methionineresidues' of' pepIdes' beginning' with' Met>

• Publication titles Cytoplasm.>!>' ALTERNATIVE' PRODUCTS:Event=AlternaIve' splicing;' Named'

SIMILARITY:' Belongs' to' the' acetyltransferase ' family.'
• GO annotations 122830;' >.UCSC;' uc001xcx.2;' human.CTD;' 122830;' >.GeneCards;' GC14P038022;' >.H>
InvDB;' HIX0011696;' >.HGNC;' HGNC:19844;' NAA30.neXtProt;' NX_Q147X3;'
> . P h a r m G K B ;' P A 1 3 4 9 3 1 3 1 5 ;' > . e g g N O G ;' p r N O G 1 5 4 6 3 ;' > . G e n e T r e e ;'
• Pathway annotations Q147X3;' >.NextBio;' 81013;' >.ArrayExpress;' Q147X3;' >.Bgee;' Q147X3;' >.CleanEx;'
• GeneRIFs Acyl_CoA_acyltransferase;' 1.Pfam;' PF00583;' Acetyltransf_1;' 1.SUPFAM;' SSF55729;'
'

• Protein complexes, domains, interactions

• We ﬁlter for author names, db names, numbers

The
NCBO
annotator

• Simple string matching using
mgrep

• Synonyms are annotated Gene$Ontology$

3"
Biological$ Molecular$

• Annotations are propagated process$ func6on$

Cellular$
Apoptosis$ signaling$
to the root func6on$

Cell$cycle$ontology$
• Mappings between terms
Biological$process$ Biological$con6nuant$

from different ontologies Cytokine6c$
process$
DNA$
replica6on$
ini6a6on$ Acetyltransferase$
• No NLP $

• Very fast

1C.
Jonquet
et
al.
AMIA
Summit
on
Translational
Bioinformatics
(2009)

Annotation
results

561,577,156 annotations of 73,248 genes
and 146,271 proteins to 404,347 terms from
246 ontologies for 4 organism (human,
mouse, ﬂy and worm)

10
Most
annotated
ontologies

12,584,050
13,177,577 35,767,064 SNOMED Clinical Terms
NCI Thesaurus
13,341,529
NIFSTD
CRISP Thesaurus, 2006
13,526,445
Read Codes, Clinical Terms Version 3 (CTV3)
34,137,453 Galen
14,541,946 Suggested Ontology for Pharmacogenomics
Gene Ontology Extension
15,937,632 Human developmental anatomy, timed version
33,628,760 Gene Ontology
16,338,723

STOP/GO
evaluation

• Compared genes/proteins GO annotation vs our GO
annotations

• High recall 0.95-0.99 (we reproduce existing
annotations)

• Lower precision 0.6-0.75 (we add new annotations)

• How good are the novel annotations?

Novel
GO
annotation
examples

human protein carboxylesterase 1 (P23141).
annotated to

“cocaine metabolic process” (GO:0050783)
based on title for a reference paper:

“Structural basis of heroin and cocaine metabolism by a promiscuous human drug-processing enzyme”

C. elegans protein (Q27539) ATP-dependent Clp protease proteolytic subunit 1
annotated to

“mitochondrial unfolded protein response” (GO:0034514)
based on title for a reference paper:

“ClpP mediates activation of a mitochondrial unfolded protein response in C. elegans”

Critical
assessment
of
functional

annotations
(CAFA)

Mooneygroup was assessing-group for CAFA

• Collect all proteins from • Collect novel experimental
UniProt, that have no GO GO annotations for the same
annotation with experimental proteins
evidence
• Compare predictions with
• Submission of predicted GO novel experimentally
annotations validated annotations

January June
2010 2010

CAFA
results

• Our
annotations
ranked
15th
out
of
40
when
predicting
Molecular
function
(MFO)

annotations:
F-‐measure
0.4
(best
0.48)

• Our
annotations
ranked
7th
out
of
36
when
predicting
Biological
process
(BPO)

annotations:
F-‐measure
0.33
(best
0.35)

1http://biofunctionprediction.org/

Statistical
Tracking
of
Ontological

Phrases
(STOP)

Enrichment analysis web application using automated annotations

http://mooneygroup.org/stop

Statistical
Tracking
of
Ontological

Phrases
(STOP)


• Results as table


Statistical
Tracking
of
Ontological

Phrases
(STOP)



• Termcloud


Statistical
Tracking
of
Ontological

Phrases
(STOP)



• Termcloud

• Revisit previous results


Statistical
Tracking
of
Ontological

Phrases
(STOP)



• Termcloud


• Entrez gene id, gene
symbol or UniProt id

• Custom background


Statistical
Tracking
of
Ontological

Phrases
(STOP)



• Termcloud


• Entrez gene id, gene
symbol or UniProt id

• Custom background

• Filter results by ontology


Summary

STOP using just GO .... because we provide

• Automated annotations of genes to terms from over 200 ontologies

• Enrichment analysis on novel annotations corrects for putative false
positives and expands realm of testable hypothesis

• Easy-to-use web interface allows quick analysis and assists in
navigation of results

• Human, worm, mouse,ﬂy (...many more to come)

• http://mooneygroup.org/stop

Acknowledgements


Buck
Institute
for
Research
on
Aging
Sean
Mooney,
Emily
TerAvest,
Uday
Evani,
Ari
Berman,Tal
Oron
Ronnen,
Mathew
Fleisch,
Corey
Powell

NCBO

Funding
Nigam
Shah
and
Trish
Whetzel

NIH
R01
LM009722
(PI:Mooney),
NIH
U54-‐HG004028
(PI:

Musen),
NIH
T32-‐AG000266
(PIs:
Campisi,Ellerby),
NIH
CAFA

UL1DE019608
(PI:
Lithgow),

NIH
RL9AG032114
(U54

Iddo
Friedberg,
Predrag
Radivojac

Geroscience),
the
NCBO
and
the
Buck
Trust.

Wyatt
Clark

Stop biocuration

More Related Content

What's hot

Viewers also liked

Similar to Stop biocuration

Recently uploaded

Stop biocuration