SlideShare a Scribd company logo
1 of 19
Analytic
Methods
Big Data
Software
Data
Integration
FAIR
High
Throughput
Tech
Bioinformatics, Computational and Systems Biology
What 20 years of gene function annotation
reveals about the Arabidopsis genome
Leonore Reiser
Phoenix Bioinformatics and The Arabidopsis Information Resource
Outline
• What do we mean by gene function annotation
• Why use Gene Ontology annotations as proxy for what is known
• What do we know
• Changes to the annotation landscape over time
• Current status of the genome annotation
• What we don’t know
• How we can close the gap
Biological Role/Activity (Gene
Ontology Annotations)
• GO Biological Process
• GO Molecular Function
• GO Cellular Component
Expression (Plant Ontology
Annotations)
• PO Structure
• PO Developmental Stage
Alleles and Phenotypes Nomenclature/Symbols and
curated summaries
Arabidopsis
Genes
Types of function information captured from the literature at TAIR
Practical applications of GO annotations
• Functional annotations based on
sequence, structural
similarity/orthology
• e.g. InterProscan GO
assignment based on domains
• Term Enrichment
• Statistical analysis of gene
lists
(over/underrepresentation)
• Gene set, genome classification
GO Evidence Codes
• Inferred from Electronic Annotation
(IEA)
• Inferred from Sequence or
structural Similarity (ISS)
• Inferred from Genomic Context
(IGC)
• Inferred from Biological aspect of
Ancestor (IBA)
• Inferred from Biological aspect of
Descendant(IBD)
• Inferred from Sequence Orthology
(ISO)
• Inferred from Key Residues (IKR)
• Inferred from Rapid Divergence
(IRD)
• Inferred from Reviewed
Computational Analysis (RCA)
• ….
• Traceable Author Statement (TAS)
• Non-traceable Author Statement
(NAS)
• Inferred by Curator (IC)
• No Biological Data available (ND)
• Inferred from Direct Assay (IDA)
• Inferred from Physical Interaction
(IPI)
• Inferred from Mutant Phenotype
(IMP)
• Inferred from Genetic Interaction
(IGI)
• Inferred from Expression Pattern
(IEP)
http://geneontology.org/page/guide-go-evidence-codes
ND (No Biological Data)
Genes that have no associated data from papers are marked
as annotations to the root ontology term (e.g. molecular
function) with an evidence code of ND.
EXP
Non-EXP
EXP (green) and Non EXP (blue) annotations for some key
model organisms and plants
Retrieved from AmiGO 7/16/2019
0
100000
200000
300000
400000
500000
600000
Species
EXP
Non
EXP
Numberofannotations
0
50000
100000
150000
200000
250000
300000
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
NumberofAnnotations
Year
#EXP
#ND
NON-EXP/ND
*
The number and type of GO annotations changes
Number of Annotated Genes by Evidence Type
0
5000
10000
15000
20000
25000
30000
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
NumberofAnnotatedGenes
Year
EXP NON EXP ND
29%
31%
22%
18%
24%
34%
22%
20%
Molecular Function Biological Process Cellular Component
35%
57%
3%5%
Known: EXP
Known: Other Unannotated
Unknown
Arabidopsis Genome (proteome) Snapshot
Data from TAIR 6/2019
Genes with experimental annotations to all 3 GO
aspects
Data from TAIR 7/22/2019
MF
CCBP
Is this really a reflection of what is
‘known’
• We have not comprehensively curated the literature
• TAIR curation triage focus is on ‘unknowns’
• There is a backlog of papers
• Function is known but ‘missing ‘ or unavailable data
• Unpublished data
• Published data that is not linked to genes
~1000 proteins have no GO annotations at all
Source: TAIR 6/2019
Why are they unknown?
• Non biological reasons related to publication (previously noted)
• Biology/tractability
• Duplication/redundancy/orphans
• Difficult to detect, low level expression
• No existing mutants
• Plant specific (i.e. no phylogenetic inference from well studied non plant
species)
• Difficulty inferring function/process via computation, no EXP baseline
Are any associated to
papers with curatable
data?
(manually review)
Curate and add GO terms Make ND annotation if one
does not exist
Are there orthologs with
associated exp functions ?
Yes No
Find orthologs
NoYes
A strategy for curating Arabidopsis genes with no known functions
Arabidopsis genes with
unknown/missing functions
Conserved Unknowns
Genes with known functions
ISS/IBA
Genes with known functions
EXP
So, where are we now
• Annotation landscape is still changing
• Effect on downstream analysis
• 26043 protein coding genes with at least 1 annotation
(94%), 13,033 (50%) have experimental evidence
• 3756 experimentally annotated proteins have data for all
3 GO aspects (~14% of all proteins)
• Rhee and Mutwill reported 1447 in 2014…progress!
Filling in the gaps
• Increase capture of what is known from the Arabidopsis literature (retroactive)
• Increase curator output and community input (TOAST)
• Gene centric triage: review new papers since last annotated
• Automation /Machine Learning to extract from literature
• Curate as part of publication process (proactive)
• TOAST for Arabidopsis genes
• ASPB/Gramene curation concurrent with publication
• Arabidopsis Micropublications! (ask us about this)
• Curate experimental data from other plant species
• GOAT project @Phoenix-annotate any gene from any genome
• PhyloGenes (www.phylogenes.org) project -Phylogenetic inference of function
• Other DB based curation efforts (e.g. UniProt, MaizeGDB, SGN)
Thanks to….
Peifen
Erica
Leonore
Tanya
Shabari
Amina
ConnieLaura
Eva
TAIR
Community
Trilok
Xinggou
Swapnil
Qian
Efrain
Nic
Thu
Questions/Comments
• Visit us here: Booth 403
• Email us: curator@arabidopsis.org

More Related Content

What's hot

Explorations in bioinformatics
Explorations in bioinformaticsExplorations in bioinformatics
Explorations in bioinformaticsDouglas Joubert
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in BioinformaticsArindam Ghosh
 
Introducing the KnetMiner Knowledge Graph: things, not strings
Introducing the KnetMiner Knowledge Graph: things, not stringsIntroducing the KnetMiner Knowledge Graph: things, not strings
Introducing the KnetMiner Knowledge Graph: things, not stringsKeywan Hassani-Pak
 
Tools of bioinforformatics by kk
Tools of bioinforformatics by kkTools of bioinforformatics by kk
Tools of bioinforformatics by kkKAUSHAL SAHU
 

What's hot (6)

Stanford workshop2020
Stanford workshop2020Stanford workshop2020
Stanford workshop2020
 
Explorations in bioinformatics
Explorations in bioinformaticsExplorations in bioinformatics
Explorations in bioinformatics
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
 
Introducing the KnetMiner Knowledge Graph: things, not strings
Introducing the KnetMiner Knowledge Graph: things, not stringsIntroducing the KnetMiner Knowledge Graph: things, not strings
Introducing the KnetMiner Knowledge Graph: things, not strings
 
Guttenberg Resume
Guttenberg ResumeGuttenberg Resume
Guttenberg Resume
 
Tools of bioinforformatics by kk
Tools of bioinforformatics by kkTools of bioinforformatics by kk
Tools of bioinforformatics by kk
 

Similar to Reiser aspb2019 asgiven

TAIR -Using biological ontologies to accelerate progress in plant biology res...
TAIR -Using biological ontologies to accelerate progress in plant biology res...TAIR -Using biological ontologies to accelerate progress in plant biology res...
TAIR -Using biological ontologies to accelerate progress in plant biology res...Phoenix Bioinformatics
 
ICAR2016 TAIR talk
ICAR2016 TAIR talkICAR2016 TAIR talk
ICAR2016 TAIR talkDonghui Li
 
Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...Catherine Canevet
 
Gene Ontology Project
Gene Ontology ProjectGene Ontology Project
Gene Ontology Projectvaibhavdeoda
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesLeighton Pritchard
 
PhoenixBio 2020 Stanford Workshop on PhyloGenes
PhoenixBio 2020 Stanford Workshop on PhyloGenesPhoenixBio 2020 Stanford Workshop on PhyloGenes
PhoenixBio 2020 Stanford Workshop on PhyloGenesPhoenix Bioinformatics
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giantsBenjamin Good
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOAEBI
 
Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120Sucheta Tripathy
 
Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...
Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...
Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...Nathan Dunn
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 
RDA Wheat Data Interoperability Cookbook and last developments
RDA Wheat Data Interoperability Cookbook and last developmentsRDA Wheat Data Interoperability Cookbook and last developments
RDA Wheat Data Interoperability Cookbook and last developmentsCIARD Movement
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppSimon Jupp
 
Inferring gene functions and regulatory interactions in plants using differen...
Inferring gene functions and regulatory interactions in plants using differen...Inferring gene functions and regulatory interactions in plants using differen...
Inferring gene functions and regulatory interactions in plants using differen...Klaas Vandepoele
 
Lock - PomBase community curation
Lock - PomBase community curationLock - PomBase community curation
Lock - PomBase community curationPascale Gaudet
 

Similar to Reiser aspb2019 asgiven (20)

TAIR -Using biological ontologies to accelerate progress in plant biology res...
TAIR -Using biological ontologies to accelerate progress in plant biology res...TAIR -Using biological ontologies to accelerate progress in plant biology res...
TAIR -Using biological ontologies to accelerate progress in plant biology res...
 
ICAR2016 TAIR talk
ICAR2016 TAIR talkICAR2016 TAIR talk
ICAR2016 TAIR talk
 
Tair workshop stanford2017
Tair workshop stanford2017Tair workshop stanford2017
Tair workshop stanford2017
 
Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...Creating an integrated Ondex knowledge base for comparative gene function ana...
Creating an integrated Ondex knowledge base for comparative gene function ana...
 
Gene Ontology Project
Gene Ontology ProjectGene Ontology Project
Gene Ontology Project
 
TAIR Presentation ASPB 2017
TAIR Presentation ASPB 2017TAIR Presentation ASPB 2017
TAIR Presentation ASPB 2017
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
 
PhoenixBio 2020 Stanford Workshop on PhyloGenes
PhoenixBio 2020 Stanford Workshop on PhyloGenesPhoenixBio 2020 Stanford Workshop on PhyloGenes
PhoenixBio 2020 Stanford Workshop on PhyloGenes
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
bioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics databioinformatics enabling knowledge generation from agricultural omics data
bioinformatics enabling knowledge generation from agricultural omics data
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
 
Building Communities Around Ontology Development
Building Communities Around Ontology DevelopmentBuilding Communities Around Ontology Development
Building Communities Around Ontology Development
 
Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120
 
Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...
Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...
Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...
 
OBO Foundry
OBO FoundryOBO Foundry
OBO Foundry
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
RDA Wheat Data Interoperability Cookbook and last developments
RDA Wheat Data Interoperability Cookbook and last developmentsRDA Wheat Data Interoperability Cookbook and last developments
RDA Wheat Data Interoperability Cookbook and last developments
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
 
Inferring gene functions and regulatory interactions in plants using differen...
Inferring gene functions and regulatory interactions in plants using differen...Inferring gene functions and regulatory interactions in plants using differen...
Inferring gene functions and regulatory interactions in plants using differen...
 
Lock - PomBase community curation
Lock - PomBase community curationLock - PomBase community curation
Lock - PomBase community curation
 

More from Phoenix Bioinformatics

How to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusableHow to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusablePhoenix Bioinformatics
 
2014 International Conference on Arabidopsis Research (ICAR) presentation
2014 International Conference on Arabidopsis Research (ICAR) presentation2014 International Conference on Arabidopsis Research (ICAR) presentation
2014 International Conference on Arabidopsis Research (ICAR) presentationPhoenix Bioinformatics
 
2014 Plant and Animal Genome Conference- Huala
2014 Plant and Animal Genome Conference- Huala2014 Plant and Animal Genome Conference- Huala
2014 Plant and Animal Genome Conference- HualaPhoenix Bioinformatics
 
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...Phoenix Bioinformatics
 

More from Phoenix Bioinformatics (9)

PhyloGenes Webinar Spring 2020
PhyloGenes Webinar Spring 2020PhyloGenes Webinar Spring 2020
PhyloGenes Webinar Spring 2020
 
TAIR ICAR 2010 Presentation
TAIR ICAR 2010 PresentationTAIR ICAR 2010 Presentation
TAIR ICAR 2010 Presentation
 
TAIR ASPB 2018 Presentation
TAIR ASPB 2018 PresentationTAIR ASPB 2018 Presentation
TAIR ASPB 2018 Presentation
 
How to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusableHow to make your published data findable, accessible, interoperable and reusable
How to make your published data findable, accessible, interoperable and reusable
 
2014 International Conference on Arabidopsis Research (ICAR) presentation
2014 International Conference on Arabidopsis Research (ICAR) presentation2014 International Conference on Arabidopsis Research (ICAR) presentation
2014 International Conference on Arabidopsis Research (ICAR) presentation
 
2014 ASPB Presentation- Berardini
2014 ASPB Presentation- Berardini2014 ASPB Presentation- Berardini
2014 ASPB Presentation- Berardini
 
2014 Plant and Animal Genome Conference- Huala
2014 Plant and Animal Genome Conference- Huala2014 Plant and Animal Genome Conference- Huala
2014 Plant and Animal Genome Conference- Huala
 
TAIR Presentation ICAR 2017
TAIR Presentation ICAR 2017TAIR Presentation ICAR 2017
TAIR Presentation ICAR 2017
 
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...
 

Recently uploaded

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 

Recently uploaded (20)

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 

Reiser aspb2019 asgiven

  • 2. What 20 years of gene function annotation reveals about the Arabidopsis genome Leonore Reiser Phoenix Bioinformatics and The Arabidopsis Information Resource
  • 3. Outline • What do we mean by gene function annotation • Why use Gene Ontology annotations as proxy for what is known • What do we know • Changes to the annotation landscape over time • Current status of the genome annotation • What we don’t know • How we can close the gap
  • 4. Biological Role/Activity (Gene Ontology Annotations) • GO Biological Process • GO Molecular Function • GO Cellular Component Expression (Plant Ontology Annotations) • PO Structure • PO Developmental Stage Alleles and Phenotypes Nomenclature/Symbols and curated summaries Arabidopsis Genes Types of function information captured from the literature at TAIR
  • 5. Practical applications of GO annotations • Functional annotations based on sequence, structural similarity/orthology • e.g. InterProscan GO assignment based on domains • Term Enrichment • Statistical analysis of gene lists (over/underrepresentation) • Gene set, genome classification
  • 6. GO Evidence Codes • Inferred from Electronic Annotation (IEA) • Inferred from Sequence or structural Similarity (ISS) • Inferred from Genomic Context (IGC) • Inferred from Biological aspect of Ancestor (IBA) • Inferred from Biological aspect of Descendant(IBD) • Inferred from Sequence Orthology (ISO) • Inferred from Key Residues (IKR) • Inferred from Rapid Divergence (IRD) • Inferred from Reviewed Computational Analysis (RCA) • …. • Traceable Author Statement (TAS) • Non-traceable Author Statement (NAS) • Inferred by Curator (IC) • No Biological Data available (ND) • Inferred from Direct Assay (IDA) • Inferred from Physical Interaction (IPI) • Inferred from Mutant Phenotype (IMP) • Inferred from Genetic Interaction (IGI) • Inferred from Expression Pattern (IEP) http://geneontology.org/page/guide-go-evidence-codes ND (No Biological Data) Genes that have no associated data from papers are marked as annotations to the root ontology term (e.g. molecular function) with an evidence code of ND. EXP Non-EXP
  • 7. EXP (green) and Non EXP (blue) annotations for some key model organisms and plants Retrieved from AmiGO 7/16/2019 0 100000 200000 300000 400000 500000 600000 Species EXP Non EXP Numberofannotations
  • 9. Number of Annotated Genes by Evidence Type 0 5000 10000 15000 20000 25000 30000 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 NumberofAnnotatedGenes Year EXP NON EXP ND
  • 10. 29% 31% 22% 18% 24% 34% 22% 20% Molecular Function Biological Process Cellular Component 35% 57% 3%5% Known: EXP Known: Other Unannotated Unknown Arabidopsis Genome (proteome) Snapshot Data from TAIR 6/2019
  • 11. Genes with experimental annotations to all 3 GO aspects Data from TAIR 7/22/2019 MF CCBP
  • 12. Is this really a reflection of what is ‘known’ • We have not comprehensively curated the literature • TAIR curation triage focus is on ‘unknowns’ • There is a backlog of papers • Function is known but ‘missing ‘ or unavailable data • Unpublished data • Published data that is not linked to genes
  • 13. ~1000 proteins have no GO annotations at all Source: TAIR 6/2019
  • 14. Why are they unknown? • Non biological reasons related to publication (previously noted) • Biology/tractability • Duplication/redundancy/orphans • Difficult to detect, low level expression • No existing mutants • Plant specific (i.e. no phylogenetic inference from well studied non plant species) • Difficulty inferring function/process via computation, no EXP baseline
  • 15. Are any associated to papers with curatable data? (manually review) Curate and add GO terms Make ND annotation if one does not exist Are there orthologs with associated exp functions ? Yes No Find orthologs NoYes A strategy for curating Arabidopsis genes with no known functions Arabidopsis genes with unknown/missing functions Conserved Unknowns Genes with known functions ISS/IBA Genes with known functions EXP
  • 16. So, where are we now • Annotation landscape is still changing • Effect on downstream analysis • 26043 protein coding genes with at least 1 annotation (94%), 13,033 (50%) have experimental evidence • 3756 experimentally annotated proteins have data for all 3 GO aspects (~14% of all proteins) • Rhee and Mutwill reported 1447 in 2014…progress!
  • 17. Filling in the gaps • Increase capture of what is known from the Arabidopsis literature (retroactive) • Increase curator output and community input (TOAST) • Gene centric triage: review new papers since last annotated • Automation /Machine Learning to extract from literature • Curate as part of publication process (proactive) • TOAST for Arabidopsis genes • ASPB/Gramene curation concurrent with publication • Arabidopsis Micropublications! (ask us about this) • Curate experimental data from other plant species • GOAT project @Phoenix-annotate any gene from any genome • PhyloGenes (www.phylogenes.org) project -Phylogenetic inference of function • Other DB based curation efforts (e.g. UniProt, MaizeGDB, SGN)
  • 19. Questions/Comments • Visit us here: Booth 403 • Email us: curator@arabidopsis.org

Editor's Notes

  1. At TAIR our curation focuses on capturing experimental data from the literature. This figure illustrates the types of data we capture from papers that we read and codify in TAIR to make the data machine readable and computationally accessible. Since 2001 we have been capturing experimental data about function in the form of annotations to gene ontology terms that describe the biological roles of genes within the cell and organism. We use the plant ontology to capture expression data and curate gene names and summaries. More recently we have been curating information about alleles and phenotypes. All of these represent information about gene function but for the purposes of this presentation and analysis we are going to use GO annotation as a proxy for what is known.
  2. GO is commonly used as a proxy for and representation of gene function for what is known. It is widely used as a tool for inferring gene function based on sequence/structural similarity. For generating hypothesis about gene function such as term enrichment following RNAseq to look for patterns and for classifying sets of genes. For classifying sets of genes. The quality of inferences made using GO annotations is dependent on the quality of the annotations themselves.
  3. Evidence codes provide a metric for assessing quality of annotations or the assertion. They generally fall into two broad categories. Experimental annotations are assertions based on some published, traceable experimental and is considered ‘gold standard’. Non experimental or computational annotations can also be of high quality- such as phylogenetic based annotations annotations or other supervised methods. The evidence codes provide information about the evidence in support of the assertions made between a GO term and a gene. Another type of Non EXP NON computationally based evidence codes are those that are assigned by curators. Less frequently used except for one special case.
  4. Graph showing the total number of annotations for Arabidopsis, some other non models and other annotated plant genomes. To the left are non plant species that have been subject to manual curation by various MODs and projects. To the right are other plant species with GO annotations. Very hard to see but there are a small number of EXP annotations for all of these species but the majority of annotations for other plants are based on sequence similarity, phylogenetic inference, or structural features. These methods fundamentally all rely upon having experimental data to set the baseline. Therefore to improve coverage, accuracy for all plant genomes we need to capture experimental data.
  5. This is a graph showing the total number of annotations over time from 2002 to June of this year. We first started annotating with GO in 2001 with a small number of experimental annotations. The number of EXP annotations continues to increase as does the number of non EXP. Note the increase in annotations in 2012 and then decrease a year later. This was largely from a high throughput RCA biological process dataset that was ultimately removed (affecting ~4793 loci)
  6. Graph from the last 10 years The top line represents the number of loci with at least one EXP annotation. The green line represents loci with at least one EXP annotation and the grey is at least one ND annotation. These are not mutually exclusive categories. That is a gene may be included in each set if it has an annotation to that set . Decrease in loci with unknown annotations in 2012 is due to removing transposable element genes and pseudogenes. and increase in known. Could be due to the increase in EXP but also there is an increase un predictions. 2012 was the addition of a large HTP dataset for cellular component. Red line indicates when TAIR began requiring subscriptions to fund biocuration. Inflection point could also be due to more exp data available as people use CRISPR or other technologies to assess gene function for formerly unknown or better predictions leading to non EXP. The large data set that was removed affected a relatively smaller subset of loci. A locus may exist in BOTH classes
  7. Breaking the previous dataset down and at current status. Previous were cumulative- - from this you can see that a lot of the non EXP coverage of the genome is component. We are looking specifically at the proteome now so the overall coverage percantages are hughter. Non overlapping categories .Because this is primarily what is used for transferring gene function and gene set analysis. Over the half of the proteome has some kind of annotation for process or function are there experimentally or computationally inferred.. The much higher number for cellular component probably reflects both many more proteomics experiments that are out there as well as greater ability infer localization computationally
  8. Major difference due to large number of genes with non EXP component annotations resulting in a much higher overlap so the percentage of genes that are included in the middle is greater . The other notable difference is the overlap between BP and MF annotations. It can be assumed that if you have a function for a gene then you can also annotate to a process so in theory any gene with a MF annotation should also have a BP annotation. A possible reason why the EXP overlap between the MF and BP is not as high is that some of those MF EXP annotations are BINDING annotations- and protein or chemical binding does not intuitively lead to a function. Indeed when I take those IDs and categorize or enrich for that subset they are highly enriched for binding terms (477 protein, 277 nucleic acid binding)
  9. A little over 1000 protein coding genes have no annotations at all. 7614 are missing both process/function annotations
  10. Flow for one strategy to fill in the function for unknowns Numbers refer to the ‘complete unknown set where there are no annotations at all). Focus is on unknowns in Arabidopsis that are conserved since the goal is to be a reference for other plant species. Would prioritize genes based on 1)annotation status, 2) available literature. For the set of all unk 425 have papers that need to be reviewed for curatable information. Prioritize if they have names. 175 UNKs are newly inserted from Araport 11 which may be the reason why they were not included in any gene families in PANTHER/PLAZA.