Reiser aspb2019 asgiven

Analytic
Methods
Big Data
Software
Data
Integration
FAIR
High
Throughput
Tech
Bioinformatics, Computational and Systems Biology

What 20 years of gene function annotation
reveals about the Arabidopsis genome
Leonore Reiser
Phoenix Bioinformatics and The Arabidopsis Information Resource

Outline
• What do we mean by gene function annotation
• Why use Gene Ontology annotations as proxy for what is known
• What do we know
• Changes to the annotation landscape over time
• Current status of the genome annotation
• What we don’t know
• How we can close the gap

Biological Role/Activity (Gene
Ontology Annotations)
• GO Biological Process
• GO Molecular Function
• GO Cellular Component
Expression (Plant Ontology
Annotations)
• PO Structure
• PO Developmental Stage
Alleles and Phenotypes Nomenclature/Symbols and
curated summaries
Arabidopsis
Genes
Types of function information captured from the literature at TAIR

Practical applications of GO annotations
• Functional annotations based on
sequence, structural
similarity/orthology
• e.g. InterProscan GO
assignment based on domains
• Term Enrichment
• Statistical analysis of gene
lists
(over/underrepresentation)
• Gene set, genome classification

GO Evidence Codes
• Inferred from Electronic Annotation
(IEA)
• Inferred from Sequence or
structural Similarity (ISS)
• Inferred from Genomic Context
(IGC)
• Inferred from Biological aspect of
Ancestor (IBA)
• Inferred from Biological aspect of
Descendant(IBD)
• Inferred from Sequence Orthology
(ISO)
• Inferred from Key Residues (IKR)
• Inferred from Rapid Divergence
(IRD)
• Inferred from Reviewed
Computational Analysis (RCA)
• ….
• Traceable Author Statement (TAS)
• Non-traceable Author Statement
(NAS)
• Inferred by Curator (IC)
• No Biological Data available (ND)
• Inferred from Direct Assay (IDA)
• Inferred from Physical Interaction
(IPI)
• Inferred from Mutant Phenotype
(IMP)
• Inferred from Genetic Interaction
(IGI)
• Inferred from Expression Pattern
(IEP)
http://geneontology.org/page/guide-go-evidence-codes
ND (No Biological Data)
Genes that have no associated data from papers are marked
as annotations to the root ontology term (e.g. molecular
function) with an evidence code of ND.
EXP
Non-EXP

EXP (green) and Non EXP (blue) annotations for some key
model organisms and plants
Retrieved from AmiGO 7/16/2019
0
100000
200000
300000
400000
500000
600000
Species
EXP
Non
EXP
Numberofannotations

0
50000
100000
150000
200000
250000
300000
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
NumberofAnnotations
Year
#EXP
#ND
NON-EXP/ND
*
The number and type of GO annotations changes

Number of Annotated Genes by Evidence Type
0
5000
10000
15000
20000
25000
30000
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
NumberofAnnotatedGenes
Year
EXP NON EXP ND

29%
31%
22%
18%
24%
34%
22%
20%
Molecular Function Biological Process Cellular Component
35%
57%
3%5%
Known: EXP
Known: Other Unannotated
Unknown
Arabidopsis Genome (proteome) Snapshot
Data from TAIR 6/2019

Genes with experimental annotations to all 3 GO
aspects
Data from TAIR 7/22/2019
MF
CCBP

Is this really a reflection of what is
‘known’
• We have not comprehensively curated the literature
• TAIR curation triage focus is on ‘unknowns’
• There is a backlog of papers
• Function is known but ‘missing ‘ or unavailable data
• Unpublished data
• Published data that is not linked to genes

~1000 proteins have no GO annotations at all
Source: TAIR 6/2019

Why are they unknown?
• Non biological reasons related to publication (previously noted)
• Biology/tractability
• Duplication/redundancy/orphans
• Difficult to detect, low level expression
• No existing mutants
• Plant specific (i.e. no phylogenetic inference from well studied non plant
species)
• Difficulty inferring function/process via computation, no EXP baseline

Are any associated to
papers with curatable
data?
(manually review)
Curate and add GO terms Make ND annotation if one
does not exist
Are there orthologs with
associated exp functions ?
Yes No
Find orthologs
NoYes
A strategy for curating Arabidopsis genes with no known functions
Arabidopsis genes with
unknown/missing functions
Conserved Unknowns
Genes with known functions
ISS/IBA
Genes with known functions
EXP

So, where are we now
• Annotation landscape is still changing
• Effect on downstream analysis
• 26043 protein coding genes with at least 1 annotation
(94%), 13,033 (50%) have experimental evidence
• 3756 experimentally annotated proteins have data for all
3 GO aspects (~14% of all proteins)
• Rhee and Mutwill reported 1447 in 2014…progress!

Filling in the gaps
• Increase capture of what is known from the Arabidopsis literature (retroactive)
• Increase curator output and community input (TOAST)
• Gene centric triage: review new papers since last annotated
• Automation /Machine Learning to extract from literature
• Curate as part of publication process (proactive)
• TOAST for Arabidopsis genes
• ASPB/Gramene curation concurrent with publication
• Arabidopsis Micropublications! (ask us about this)
• Curate experimental data from other plant species
• GOAT project @Phoenix-annotate any gene from any genome
• PhyloGenes (www.phylogenes.org) project -Phylogenetic inference of function
• Other DB based curation efforts (e.g. UniProt, MaizeGDB, SGN)

Thanks to….
Peifen
Erica
Leonore
Tanya
Shabari
Amina
ConnieLaura
Eva
TAIR
Community
Trilok
Xinggou
Swapnil
Qian
Efrain
Nic
Thu

Questions/Comments
• Visit us here: Booth 403
• Email us: curator@arabidopsis.org

Reiser aspb2019 asgiven

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Reiser aspb2019 asgiven

Similar to Reiser aspb2019 asgiven (20)

More from Phoenix Bioinformatics

More from Phoenix Bioinformatics (9)

Recently uploaded

Recently uploaded (20)

Reiser aspb2019 asgiven

Editor's Notes