This document discusses what is known about gene function annotation in Arabidopsis thaliana based on 20 years of curation efforts. Around 94% of protein coding genes have at least one annotation, but only 50% have experimental evidence annotations. Efforts are ongoing to fill gaps, including increasing literature curation, using automation/AI, and phylogenetic inference. Continued collaboration is needed to fully characterize gene functions in Arabidopsis.
KnetMiner provides an easy to use web interface to visualisation and data mining tools for the discovery and evaluation of candidate genes from large scale integrations of public and private data sets. It addresses the needs of scientists who generally lack the time and technical expertise to review all relevant information available in the literature, from key model species and from a potentially wide range of related biological databases. We have previously developed genome-scale knowledge networks (GSKNs) for multiple crop and animal species (Hassani-Pak et al. 2016). The KnetMiner web server searches and evaluates millions of relations and concepts within the GSKNs in real-time to determine if direct or indirect links between genes and trait-based keywords can be established. KnetMiner accepts as user inputs: search terms in combination with a gene list and/or genomic regions. It produces a table of ranked candidate genes and allows users to explore the output in interactive genome and network map visualisation tools that have been optimised for web use on desktop and mobile devices. The KnetMiner web server and the GSKNs provide a step-forward towards systematic and evidence-based gene discovery.
KnetMiner provides an easy to use web interface to visualisation and data mining tools for the discovery and evaluation of candidate genes from large scale integrations of public and private data sets. It addresses the needs of scientists who generally lack the time and technical expertise to review all relevant information available in the literature, from key model species and from a potentially wide range of related biological databases. We have previously developed genome-scale knowledge networks (GSKNs) for multiple crop and animal species (Hassani-Pak et al. 2016). The KnetMiner web server searches and evaluates millions of relations and concepts within the GSKNs in real-time to determine if direct or indirect links between genes and trait-based keywords can be established. KnetMiner accepts as user inputs: search terms in combination with a gene list and/or genomic regions. It produces a table of ranked candidate genes and allows users to explore the output in interactive genome and network map visualisation tools that have been optimised for web use on desktop and mobile devices. The KnetMiner web server and the GSKNs provide a step-forward towards systematic and evidence-based gene discovery.
Introducing the KnetMiner Knowledge Graph: things, not stringsKeywan Hassani-Pak
Rothamsted Seminar Series by Keywan Hassani-Pak, 1 April 2019
Researchers at Rothamsted and around the world are working to push the boundaries of human knowledge. One would think they have access to the best available tools to help them in their quest for knowledge. In reality the opposite is often true: the research tools at our disposal are only substandard and therefore searching and discovering new biological clues still requires a lot of hard work. We have developed an intelligent data model, known as the KnetMiner Knowledge Graph, that helps researchers to discover new information quickly and easily. Knowledge graphs are commonly used to represent biological entities and their relationships to one another: i.e. things, not strings. Our wheat Knowledge Graph, for example, currently contains more than 1.5 million objects and 6 million facts about, and relations between, these different objects. KnetMiner (www.knetminer.org) enables you to search the Knowledge Graph for genes, phenotypes, diseases, stresses, molecules and more - and instantly tell you the stories of complex traits.
INTRODUCTION
DEFINITION OF BIOINFORMATICS
HISTORY
OBJECTIVES OF BIOINFORMATICS
TOOLS OF BIOINFORMATICS
BIOLOGICAL DATABASES
HOMOLOGY AND SIMILARITY TOOLS (SEQUENCE ALIGNMENT)
PROTEIN FUNCTION ANALYSIS TOOLS
STRUCTURAL ANALYSIS TOOLS
SEQUENCE MANIPULATION TOOLS
SEQUENCE ANALYSIS TOOLS
APPLICATION
CONCLUSION
REFERENCES
Keynote presentation from Plant and Pathogen Bioinformatics workshop at EMBL-EBI, 8-11 July 2014
Slides and teaching material are available at https://github.com/widdowquinn/Teaching-EMBL-Plant-Path-Genomics
Event: Plant and Animal Genomes conference 2012
Speaker: Rachael Huntley
The Gene Ontology (GO) is a well-established, structured vocabulary used in the functional annotation of gene products. GO terms are used to replace the multiple nomenclatures used by scientific databases that can hamper data integration. Currently, GO consists of more than 35,000 terms describing the molecular function, biological process and subcellular location of a gene product in a generic cell. The UniProt-Gene Ontology Annotation (UniProt-GOA) database1 provides high-quality manual and electronic GO annotations to proteins within UniProt. By annotating well-studied proteins with GO terms and transferring this knowledge to less well-studied and novel proteins that are highly similar, we offer a valuable contribution to the understanding of all proteomes. UniProt-GOA provides annotated entries for over 387,000 species and is the largest and most comprehensive open-source contributor of annotations to the GO Consortium annotation effort. Annotation files for various proteomes are released each month, including human, mouse, rat, zebrafish, cow, chicken, dog, pig, Arabidopsis and Dictyostelium, as well as a file for the multiple species within UniProt. The UniProt-GOA dataset can be queried through our user-friendly QuickGO browser2 or downloaded in a parsable format via the EBI3 and GO Consortium FTP4 sites. The UniProt-GOA dataset has increasingly been integrated into tools that aid in the analysis of large datasets resulting from high-throughput experiments thus assisting researchers in biological interpretation of their results. The annotations produced by UniProt-GOA are additionally cross-referenced in databases such as Ensembl and NCBI Entrez Gene.
1 http://www.ebi.ac.uk/GOA
2 http://www.ebi.ac.uk/QuickGO
3 ftp://ftp.ebi.ac.uk/pub/databases/GO/goa
4 ftp://ftp.geneontology.org/pub/go/gene-associations
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
Specialty gene sets, such as virulence factors and antibiotic resistance genes, are of particular interest to infectious disease researchers. Much of the information about specialty genes’ function is described in literature but unavailable as structured data in bioinformatics databases. The steadily increasing volume of literature makes it difficult to manually find relevant papers and extract assertion sentences about specialty genes. This presentation describes efforts to build and an automatic classifier for such sentences. Experiments were conducted to assess the impact of the imbalance of positive and negative examples in source documents on classification; develop a support vector machine (SVM) classifier using term frequency-inverse document frequency (TF-IDF) representation of text; and assess the marginal benefit of additional training examples on the quality of the classifier. Analysis of learning curves indicates that additional training examples will not likely improve the quality of the classifier. We discuss options for other text representation schemes to investigate in order to improve the quality of the classifier as measured by F-score.
Introducing the KnetMiner Knowledge Graph: things, not stringsKeywan Hassani-Pak
Rothamsted Seminar Series by Keywan Hassani-Pak, 1 April 2019
Researchers at Rothamsted and around the world are working to push the boundaries of human knowledge. One would think they have access to the best available tools to help them in their quest for knowledge. In reality the opposite is often true: the research tools at our disposal are only substandard and therefore searching and discovering new biological clues still requires a lot of hard work. We have developed an intelligent data model, known as the KnetMiner Knowledge Graph, that helps researchers to discover new information quickly and easily. Knowledge graphs are commonly used to represent biological entities and their relationships to one another: i.e. things, not strings. Our wheat Knowledge Graph, for example, currently contains more than 1.5 million objects and 6 million facts about, and relations between, these different objects. KnetMiner (www.knetminer.org) enables you to search the Knowledge Graph for genes, phenotypes, diseases, stresses, molecules and more - and instantly tell you the stories of complex traits.
INTRODUCTION
DEFINITION OF BIOINFORMATICS
HISTORY
OBJECTIVES OF BIOINFORMATICS
TOOLS OF BIOINFORMATICS
BIOLOGICAL DATABASES
HOMOLOGY AND SIMILARITY TOOLS (SEQUENCE ALIGNMENT)
PROTEIN FUNCTION ANALYSIS TOOLS
STRUCTURAL ANALYSIS TOOLS
SEQUENCE MANIPULATION TOOLS
SEQUENCE ANALYSIS TOOLS
APPLICATION
CONCLUSION
REFERENCES
Keynote presentation from Plant and Pathogen Bioinformatics workshop at EMBL-EBI, 8-11 July 2014
Slides and teaching material are available at https://github.com/widdowquinn/Teaching-EMBL-Plant-Path-Genomics
Event: Plant and Animal Genomes conference 2012
Speaker: Rachael Huntley
The Gene Ontology (GO) is a well-established, structured vocabulary used in the functional annotation of gene products. GO terms are used to replace the multiple nomenclatures used by scientific databases that can hamper data integration. Currently, GO consists of more than 35,000 terms describing the molecular function, biological process and subcellular location of a gene product in a generic cell. The UniProt-Gene Ontology Annotation (UniProt-GOA) database1 provides high-quality manual and electronic GO annotations to proteins within UniProt. By annotating well-studied proteins with GO terms and transferring this knowledge to less well-studied and novel proteins that are highly similar, we offer a valuable contribution to the understanding of all proteomes. UniProt-GOA provides annotated entries for over 387,000 species and is the largest and most comprehensive open-source contributor of annotations to the GO Consortium annotation effort. Annotation files for various proteomes are released each month, including human, mouse, rat, zebrafish, cow, chicken, dog, pig, Arabidopsis and Dictyostelium, as well as a file for the multiple species within UniProt. The UniProt-GOA dataset can be queried through our user-friendly QuickGO browser2 or downloaded in a parsable format via the EBI3 and GO Consortium FTP4 sites. The UniProt-GOA dataset has increasingly been integrated into tools that aid in the analysis of large datasets resulting from high-throughput experiments thus assisting researchers in biological interpretation of their results. The annotations produced by UniProt-GOA are additionally cross-referenced in databases such as Ensembl and NCBI Entrez Gene.
1 http://www.ebi.ac.uk/GOA
2 http://www.ebi.ac.uk/QuickGO
3 ftp://ftp.ebi.ac.uk/pub/databases/GO/goa
4 ftp://ftp.geneontology.org/pub/go/gene-associations
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
Specialty gene sets, such as virulence factors and antibiotic resistance genes, are of particular interest to infectious disease researchers. Much of the information about specialty genes’ function is described in literature but unavailable as structured data in bioinformatics databases. The steadily increasing volume of literature makes it difficult to manually find relevant papers and extract assertion sentences about specialty genes. This presentation describes efforts to build and an automatic classifier for such sentences. Experiments were conducted to assess the impact of the imbalance of positive and negative examples in source documents on classification; develop a support vector machine (SVM) classifier using term frequency-inverse document frequency (TF-IDF) representation of text; and assess the marginal benefit of additional training examples on the quality of the classifier. Analysis of learning curves indicates that additional training examples will not likely improve the quality of the classifier. We discuss options for other text representation schemes to investigate in order to improve the quality of the classifier as measured by F-score.
RDA Wheat Data Interoperability Cookbook and last developmentsCIARD Movement
Esther Dzale, French National Institute for Agricultural Research (INRA), France, and Richard Fulss. International Maize and Wheat Improvement Center (CIMMYT), at RDA 5th Plenary Meeting, IG Agriculture Data Interoperability Session in San Diego (CA, US) on the 9th of March 2015
Facilitating semantic alignment of EMBL-EBI services using ontologies and semantic web technology. Presentation at the BioHackathon Symposium 2016, Japan.
PomBase Community Curation: A Fast Track to Capture Expert Knowledge, Antonia Lock, Kim Rutherford, Midori Harris, Mark Mcdowall, Paul Kersey, Stephen Oliver, Jurg Bahler and Valerie Wood.
Presented at the 5th International Biocuration Conference, hosted by PIR in Washington, DC, April 2-4, 2012.
Introduction to an online resource that displays pre-computed phylogenetic trees of gene families alongside experimental gene function data to facilitate inference of unknown gene function in plants. From the same team that brings you TAIR (The Arabidopsis Information Resource)
How to make your published data findable, accessible, interoperable and reusablePhoenix Bioinformatics
Seminar Presentation for PMB Department, UC Berkeley for Love Data Week. Subject is how to prepare publications and associated data sets for maximum reuse.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
2. What 20 years of gene function annotation
reveals about the Arabidopsis genome
Leonore Reiser
Phoenix Bioinformatics and The Arabidopsis Information Resource
3. Outline
• What do we mean by gene function annotation
• Why use Gene Ontology annotations as proxy for what is known
• What do we know
• Changes to the annotation landscape over time
• Current status of the genome annotation
• What we don’t know
• How we can close the gap
4. Biological Role/Activity (Gene
Ontology Annotations)
• GO Biological Process
• GO Molecular Function
• GO Cellular Component
Expression (Plant Ontology
Annotations)
• PO Structure
• PO Developmental Stage
Alleles and Phenotypes Nomenclature/Symbols and
curated summaries
Arabidopsis
Genes
Types of function information captured from the literature at TAIR
5. Practical applications of GO annotations
• Functional annotations based on
sequence, structural
similarity/orthology
• e.g. InterProscan GO
assignment based on domains
• Term Enrichment
• Statistical analysis of gene
lists
(over/underrepresentation)
• Gene set, genome classification
6. GO Evidence Codes
• Inferred from Electronic Annotation
(IEA)
• Inferred from Sequence or
structural Similarity (ISS)
• Inferred from Genomic Context
(IGC)
• Inferred from Biological aspect of
Ancestor (IBA)
• Inferred from Biological aspect of
Descendant(IBD)
• Inferred from Sequence Orthology
(ISO)
• Inferred from Key Residues (IKR)
• Inferred from Rapid Divergence
(IRD)
• Inferred from Reviewed
Computational Analysis (RCA)
• ….
• Traceable Author Statement (TAS)
• Non-traceable Author Statement
(NAS)
• Inferred by Curator (IC)
• No Biological Data available (ND)
• Inferred from Direct Assay (IDA)
• Inferred from Physical Interaction
(IPI)
• Inferred from Mutant Phenotype
(IMP)
• Inferred from Genetic Interaction
(IGI)
• Inferred from Expression Pattern
(IEP)
http://geneontology.org/page/guide-go-evidence-codes
ND (No Biological Data)
Genes that have no associated data from papers are marked
as annotations to the root ontology term (e.g. molecular
function) with an evidence code of ND.
EXP
Non-EXP
7. EXP (green) and Non EXP (blue) annotations for some key
model organisms and plants
Retrieved from AmiGO 7/16/2019
0
100000
200000
300000
400000
500000
600000
Species
EXP
Non
EXP
Numberofannotations
12. Is this really a reflection of what is
‘known’
• We have not comprehensively curated the literature
• TAIR curation triage focus is on ‘unknowns’
• There is a backlog of papers
• Function is known but ‘missing ‘ or unavailable data
• Unpublished data
• Published data that is not linked to genes
14. Why are they unknown?
• Non biological reasons related to publication (previously noted)
• Biology/tractability
• Duplication/redundancy/orphans
• Difficult to detect, low level expression
• No existing mutants
• Plant specific (i.e. no phylogenetic inference from well studied non plant
species)
• Difficulty inferring function/process via computation, no EXP baseline
15. Are any associated to
papers with curatable
data?
(manually review)
Curate and add GO terms Make ND annotation if one
does not exist
Are there orthologs with
associated exp functions ?
Yes No
Find orthologs
NoYes
A strategy for curating Arabidopsis genes with no known functions
Arabidopsis genes with
unknown/missing functions
Conserved Unknowns
Genes with known functions
ISS/IBA
Genes with known functions
EXP
16. So, where are we now
• Annotation landscape is still changing
• Effect on downstream analysis
• 26043 protein coding genes with at least 1 annotation
(94%), 13,033 (50%) have experimental evidence
• 3756 experimentally annotated proteins have data for all
3 GO aspects (~14% of all proteins)
• Rhee and Mutwill reported 1447 in 2014…progress!
17. Filling in the gaps
• Increase capture of what is known from the Arabidopsis literature (retroactive)
• Increase curator output and community input (TOAST)
• Gene centric triage: review new papers since last annotated
• Automation /Machine Learning to extract from literature
• Curate as part of publication process (proactive)
• TOAST for Arabidopsis genes
• ASPB/Gramene curation concurrent with publication
• Arabidopsis Micropublications! (ask us about this)
• Curate experimental data from other plant species
• GOAT project @Phoenix-annotate any gene from any genome
• PhyloGenes (www.phylogenes.org) project -Phylogenetic inference of function
• Other DB based curation efforts (e.g. UniProt, MaizeGDB, SGN)
At TAIR our curation focuses on capturing experimental data from the literature. This figure illustrates the types of data we capture from papers that we read and codify in TAIR to make the data machine readable and computationally accessible. Since 2001 we have been capturing experimental data about function in the form of annotations to gene ontology terms that describe the biological roles of genes within the cell and organism. We use the plant ontology to capture expression data and curate gene names and summaries. More recently we have been curating information about alleles and phenotypes. All of these represent information about gene function but for the purposes of this presentation and analysis we are going to use GO annotation as a proxy for what is known.
GO is commonly used as a proxy for and representation of gene function for what is known.
It is widely used as a tool for inferring gene function based on sequence/structural similarity.
For generating hypothesis about gene function such as term enrichment following RNAseq to look for patterns and for classifying sets of genes. For classifying sets of genes.
The quality of inferences made using GO annotations is dependent on the quality of the annotations themselves.
Evidence codes provide a metric for assessing quality of annotations or the assertion. They generally fall into two broad categories.
Experimental annotations are assertions based on some published, traceable experimental and is considered ‘gold standard’.
Non experimental or computational annotations can also be of high quality- such as phylogenetic based annotations annotations or other supervised methods. The evidence codes provide information about the evidence in support of the assertions made between a GO term and a gene.
Another type of Non EXP NON computationally based evidence codes are those that are assigned by curators. Less frequently used except for one special case.
Graph showing the total number of annotations for Arabidopsis, some other non models and other annotated plant genomes. To the left are non plant species that have been subject to manual curation by various MODs and projects. To the right are other plant species with GO annotations. Very hard to see but there are a small number of EXP annotations for all of these species but the majority of annotations for other plants are based on sequence similarity, phylogenetic inference, or structural features. These methods fundamentally all rely upon having experimental data to set the baseline. Therefore to improve coverage, accuracy for all plant genomes we need to capture experimental data.
This is a graph showing the total number of annotations over time from 2002 to June of this year. We first started annotating with GO in 2001 with a small number of experimental annotations. The number of EXP annotations continues to increase as does the number of non EXP. Note the increase in annotations in 2012 and then decrease a year later. This was largely from a high throughput RCA biological process dataset that was ultimately removed (affecting ~4793 loci)
Graph from the last 10 years The top line represents the number of loci with at least one EXP annotation. The green line represents loci with at least one EXP annotation and the grey is at least one ND annotation. These are not mutually exclusive categories. That is a gene may be included in each set if it has an annotation to that set . Decrease in loci with unknown annotations in 2012 is due to removing transposable element genes and pseudogenes. and increase in known. Could be due to the increase in EXP but also there is an increase un predictions. 2012 was the addition of a large HTP dataset for cellular component. Red line indicates when TAIR began requiring subscriptions to fund biocuration. Inflection point could also be due to more exp data available as people use CRISPR or other technologies to assess gene function for formerly unknown or better predictions leading to non EXP. The large data set that was removed affected a relatively smaller subset of loci. A locus may exist in BOTH classes
Breaking the previous dataset down and at current status. Previous were cumulative- - from this you can see that a lot of the non EXP coverage of the genome is component. We are looking specifically at the proteome now so the overall coverage percantages are hughter. Non overlapping categories .Because this is primarily what is used for transferring gene function and gene set analysis. Over the half of the proteome has some kind of annotation for process or function are there experimentally or computationally inferred.. The much higher number for cellular component probably reflects both many more proteomics experiments that are out there as well as greater ability infer localization computationally
Major difference due to large number of genes with non EXP component annotations resulting in a much higher overlap so the percentage of genes that are included in the middle is greater .
The other notable difference is the overlap between BP and MF annotations. It can be assumed that if you have a function for a gene then you can also annotate to a process so in theory any gene with a MF annotation should also have a BP annotation. A possible reason why the EXP overlap between the MF and BP is not as high is that some of those MF EXP annotations are BINDING annotations- and protein or chemical binding does not intuitively lead to a function. Indeed when I take those IDs and categorize or enrich for that subset they are highly enriched for binding terms (477 protein, 277 nucleic acid binding)
A little over 1000 protein coding genes have no annotations at all. 7614 are missing both process/function annotations
Flow for one strategy to fill in the function for unknowns Numbers refer to the ‘complete unknown set where there are no annotations at all). Focus is on unknowns in Arabidopsis that are conserved since the goal is to be a reference for other plant species. Would prioritize genes based on 1)annotation status, 2) available literature. For the set of all unk 425 have papers that need to be reviewed for curatable information. Prioritize if they have names. 175 UNKs are newly inserted from Araport 11 which may be the reason why they were not included in any gene families in PANTHER/PLAZA.