2. What 20 years of gene function annotation
reveals about the Arabidopsis genome
Leonore Reiser
Phoenix Bioinformatics and The Arabidopsis Information Resource
3. Outline
• What do we mean by gene function annotation
• Why use Gene Ontology annotations as proxy for what is known
• What do we know
• Changes to the annotation landscape over time
• Current status of the genome annotation
• What we don’t know
• How we can close the gap
4. Biological Role/Activity (Gene
Ontology Annotations)
• GO Biological Process
• GO Molecular Function
• GO Cellular Component
Expression (Plant Ontology
Annotations)
• PO Structure
• PO Developmental Stage
Alleles and Phenotypes Nomenclature/Symbols and
curated summaries
Arabidopsis
Genes
Types of function information captured from the literature at TAIR
5. Practical applications of GO annotations
• Functional annotations based on
sequence, structural
similarity/orthology
• e.g. InterProscan GO
assignment based on domains
• Term Enrichment
• Statistical analysis of gene
lists
(over/underrepresentation)
• Gene set, genome classification
6. GO Evidence Codes
• Inferred from Electronic Annotation
(IEA)
• Inferred from Sequence or
structural Similarity (ISS)
• Inferred from Genomic Context
(IGC)
• Inferred from Biological aspect of
Ancestor (IBA)
• Inferred from Biological aspect of
Descendant(IBD)
• Inferred from Sequence Orthology
(ISO)
• Inferred from Key Residues (IKR)
• Inferred from Rapid Divergence
(IRD)
• Inferred from Reviewed
Computational Analysis (RCA)
• ….
• Traceable Author Statement (TAS)
• Non-traceable Author Statement
(NAS)
• Inferred by Curator (IC)
• No Biological Data available (ND)
• Inferred from Direct Assay (IDA)
• Inferred from Physical Interaction
(IPI)
• Inferred from Mutant Phenotype
(IMP)
• Inferred from Genetic Interaction
(IGI)
• Inferred from Expression Pattern
(IEP)
http://geneontology.org/page/guide-go-evidence-codes
ND (No Biological Data)
Genes that have no associated data from papers are marked
as annotations to the root ontology term (e.g. molecular
function) with an evidence code of ND.
EXP
Non-EXP
7. EXP (green) and Non EXP (blue) annotations for some key
model organisms and plants
Retrieved from AmiGO 7/16/2019
0
100000
200000
300000
400000
500000
600000
Species
EXP
Non
EXP
Numberofannotations
12. Is this really a reflection of what is
‘known’
• We have not comprehensively curated the literature
• TAIR curation triage focus is on ‘unknowns’
• There is a backlog of papers
• Function is known but ‘missing ‘ or unavailable data
• Unpublished data
• Published data that is not linked to genes
14. Why are they unknown?
• Non biological reasons related to publication (previously noted)
• Biology/tractability
• Duplication/redundancy/orphans
• Difficult to detect, low level expression
• No existing mutants
• Plant specific (i.e. no phylogenetic inference from well studied non plant
species)
• Difficulty inferring function/process via computation, no EXP baseline
15. Are any associated to
papers with curatable
data?
(manually review)
Curate and add GO terms Make ND annotation if one
does not exist
Are there orthologs with
associated exp functions ?
Yes No
Find orthologs
NoYes
A strategy for curating Arabidopsis genes with no known functions
Arabidopsis genes with
unknown/missing functions
Conserved Unknowns
Genes with known functions
ISS/IBA
Genes with known functions
EXP
16. So, where are we now
• Annotation landscape is still changing
• Effect on downstream analysis
• 26043 protein coding genes with at least 1 annotation
(94%), 13,033 (50%) have experimental evidence
• 3756 experimentally annotated proteins have data for all
3 GO aspects (~14% of all proteins)
• Rhee and Mutwill reported 1447 in 2014…progress!
17. Filling in the gaps
• Increase capture of what is known from the Arabidopsis literature (retroactive)
• Increase curator output and community input (TOAST)
• Gene centric triage: review new papers since last annotated
• Automation /Machine Learning to extract from literature
• Curate as part of publication process (proactive)
• TOAST for Arabidopsis genes
• ASPB/Gramene curation concurrent with publication
• Arabidopsis Micropublications! (ask us about this)
• Curate experimental data from other plant species
• GOAT project @Phoenix-annotate any gene from any genome
• PhyloGenes (www.phylogenes.org) project -Phylogenetic inference of function
• Other DB based curation efforts (e.g. UniProt, MaizeGDB, SGN)
At TAIR our curation focuses on capturing experimental data from the literature. This figure illustrates the types of data we capture from papers that we read and codify in TAIR to make the data machine readable and computationally accessible. Since 2001 we have been capturing experimental data about function in the form of annotations to gene ontology terms that describe the biological roles of genes within the cell and organism. We use the plant ontology to capture expression data and curate gene names and summaries. More recently we have been curating information about alleles and phenotypes. All of these represent information about gene function but for the purposes of this presentation and analysis we are going to use GO annotation as a proxy for what is known.
GO is commonly used as a proxy for and representation of gene function for what is known.
It is widely used as a tool for inferring gene function based on sequence/structural similarity.
For generating hypothesis about gene function such as term enrichment following RNAseq to look for patterns and for classifying sets of genes. For classifying sets of genes.
The quality of inferences made using GO annotations is dependent on the quality of the annotations themselves.
Evidence codes provide a metric for assessing quality of annotations or the assertion. They generally fall into two broad categories.
Experimental annotations are assertions based on some published, traceable experimental and is considered ‘gold standard’.
Non experimental or computational annotations can also be of high quality- such as phylogenetic based annotations annotations or other supervised methods. The evidence codes provide information about the evidence in support of the assertions made between a GO term and a gene.
Another type of Non EXP NON computationally based evidence codes are those that are assigned by curators. Less frequently used except for one special case.
Graph showing the total number of annotations for Arabidopsis, some other non models and other annotated plant genomes. To the left are non plant species that have been subject to manual curation by various MODs and projects. To the right are other plant species with GO annotations. Very hard to see but there are a small number of EXP annotations for all of these species but the majority of annotations for other plants are based on sequence similarity, phylogenetic inference, or structural features. These methods fundamentally all rely upon having experimental data to set the baseline. Therefore to improve coverage, accuracy for all plant genomes we need to capture experimental data.
This is a graph showing the total number of annotations over time from 2002 to June of this year. We first started annotating with GO in 2001 with a small number of experimental annotations. The number of EXP annotations continues to increase as does the number of non EXP. Note the increase in annotations in 2012 and then decrease a year later. This was largely from a high throughput RCA biological process dataset that was ultimately removed (affecting ~4793 loci)
Graph from the last 10 years The top line represents the number of loci with at least one EXP annotation. The green line represents loci with at least one EXP annotation and the grey is at least one ND annotation. These are not mutually exclusive categories. That is a gene may be included in each set if it has an annotation to that set . Decrease in loci with unknown annotations in 2012 is due to removing transposable element genes and pseudogenes. and increase in known. Could be due to the increase in EXP but also there is an increase un predictions. 2012 was the addition of a large HTP dataset for cellular component. Red line indicates when TAIR began requiring subscriptions to fund biocuration. Inflection point could also be due to more exp data available as people use CRISPR or other technologies to assess gene function for formerly unknown or better predictions leading to non EXP. The large data set that was removed affected a relatively smaller subset of loci. A locus may exist in BOTH classes
Breaking the previous dataset down and at current status. Previous were cumulative- - from this you can see that a lot of the non EXP coverage of the genome is component. We are looking specifically at the proteome now so the overall coverage percantages are hughter. Non overlapping categories .Because this is primarily what is used for transferring gene function and gene set analysis. Over the half of the proteome has some kind of annotation for process or function are there experimentally or computationally inferred.. The much higher number for cellular component probably reflects both many more proteomics experiments that are out there as well as greater ability infer localization computationally
Major difference due to large number of genes with non EXP component annotations resulting in a much higher overlap so the percentage of genes that are included in the middle is greater .
The other notable difference is the overlap between BP and MF annotations. It can be assumed that if you have a function for a gene then you can also annotate to a process so in theory any gene with a MF annotation should also have a BP annotation. A possible reason why the EXP overlap between the MF and BP is not as high is that some of those MF EXP annotations are BINDING annotations- and protein or chemical binding does not intuitively lead to a function. Indeed when I take those IDs and categorize or enrich for that subset they are highly enriched for binding terms (477 protein, 277 nucleic acid binding)
A little over 1000 protein coding genes have no annotations at all. 7614 are missing both process/function annotations
Flow for one strategy to fill in the function for unknowns Numbers refer to the ‘complete unknown set where there are no annotations at all). Focus is on unknowns in Arabidopsis that are conserved since the goal is to be a reference for other plant species. Would prioritize genes based on 1)annotation status, 2) available literature. For the set of all unk 425 have papers that need to be reviewed for curatable information. Prioritize if they have names. 175 UNKs are newly inserted from Araport 11 which may be the reason why they were not included in any gene families in PANTHER/PLAZA.