Your SlideShare is downloading. ×
bioinformatics enabling knowledge generation from agricultural omics data
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

bioinformatics enabling knowledge generation from agricultural omics data


Published on

Databases and Biological Data,Generating Biological Data,NIH WORKING DEFINITION OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY,What Are Ontologies

Databases and Biological Data,Generating Biological Data,NIH WORKING DEFINITION OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY,What Are Ontologies

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. AgBase: bioinformatics enablingknowledge generation from agricultural omics data Fiona McCarthy
  • 2. Summary „omics‟ technologies: the „data deluge‟ organising data: bioinformatics and biocuration data sharing and analysis: bio-ontologies from data to knowledge making sense of agricultural data
  • 3. Databases and Biological Data The number of databases has increased  Sequence repositories: NCBI, EMBL, DDJB  Model Organism Databases (MODs)  Specialist biological databases or „knowledge databases‟ (eg, InterPro, interaction databases, gene expression data) Need to connect information in different databases Databases are increasing in size and complexity
  • 4. No.No. x 106 25000 18 16 20000 14 12 15000 10 8 10000 6 5000 4 2 0 0 „00 „01 „02 „03 „04 „05 „06 „07 „08 „09 70 75 80 85 90 95 00 05
  • 5. Generating Biological Data Amount of biological data is increasing exponentially Completed and ongoing genome sequencing projects High throughput “omics” technologies  New sequencing technologies  Existing microarrays  Proteomics
  • 6. Biocomputing Technologies enable „omics‟ technologies to move from large database/consortiums into individual laboratories Managing this data:  acquire  store  access  analyze  visualize  share
  • 7. NIH WORKING DEFINITION OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGYBioinformatics: Research, development, or application ofcomputational tools and approaches for expanding the useof biological, medical, behavioral or health data, includingthose to acquire, store, organize, archive, analyze, orvisualize such data.Computational Biology: The development and application ofdata-analytical and theoretical methods, mathematicalmodeling and computational simulation techniques to thestudy of biological, behavioral, and social systems.
  • 8. Bioinformatics Managing data  different file formats  linking between different databases Adding value  multiple levels of information from one „omics‟ data set  re-analysis  linking data sets Organizing  annotating data  biocuration - annotation
  • 9. Annotation ANNOTATE: to denote or demarcate Genome annotation is the process of attaching biological information to genomic sequences. It consists of two main steps:1. identifying functional elements in the genome: “structural annotation”2. attaching biological information to these elements: “functional annotation”
  • 10. Community Annotation Researchers are the domain experts – but relatively few contribute to annotation  time  reward & employer/funding agency recognition  training – easy to use tools, clear instructions Required submission Community annotation  Groups with special interest do focused annotation or ontology development  As part of a meeting/conference or distributed (eg. wikis) Students!
  • 11. Biocuration biocurators are biologists who are trained to annotate biological data (using database structures, bio-ontologies, etc). databases use biocuration to enhance value of biological data  “knowledge databases” but how to ensure data consistency between databases?
  • 12. What Are Ontologies?“An ontology is a controlled vocabulary of well defined termswith specified relationships between those terms, capable ofinterpretation by both humans and computers.” Bio-ontologies are used to capture biological information in a way that can be read by both humans and computers  annotate data in a consistent way  allows data sharing across databases  allows computational analysis of high-throughput “omics” datasets Objects in an ontology (eg. genes, cell types, tissue types, stages of development) are well defined. The ontology shows how the objects relate to each other
  • 13. Ontologies relationshipsbetween terms digital identifier (computers) description (humans) Gene Ontology version 1.1348 (27/07/2010): 32,091 terms, 99.3% defined 19,169 biological process 2,745 cellular component 8,736 molecular function 1,441 obsolete terms (not included in figures above)
  • 14. Relationships: the True Path Rule Why are relationships between terms important? TRUE PATH RULE: all attributes of children must hold for all parents so if a protein is annotated to a term, it must also be true for all the parent terms this enables us to move up the ontology structure from a granular term to a broader term Premise of many GO anaylsis tools
  • 15. Genomic AnnotationStructural Annotation: Open reading frames (ORFs) predicted during genome assembly predicted ORFs require experimental confirmationFunctional Annotation: annotation of gene products = Gene Ontology (GO) annotation initially, predicted ORFs have no functional literature and GO annotation relies on computational methods (rapid) functional literature exists for many genes/proteins prior to genome sequencing Gene Ontology annotation does not rely on a completed genome sequence
  • 16. Genomic Annotation Structural Annotation including Sequence Ontology Other annotations using other bio- ontologies e.g. Anatomy Ontology Nomenclature (species‟ genome nomenclature committees) Functional annotation using Gene Ontology
  • 17. Gene Ontology Plant Ontology Sequence Ontology Trait OntologyExpression/Tissue Ontologies Infectious Disease Ontology Cell Ontology
  • 18. Bio-ontology requirements bio-ontologies (Open Biomedical Ontologies) computational pipelines („breadth‟)  for computational annotations  useful for gene products without published information manual biocuration („depth‟)  requires trained biocurators  community annotation efforts  each species has its own body of literature biocuration co-ordination  MODs? Consortium? Community?  biocuration prioritization  co-ordination with existing Dbs, annotation, nomenclature initiatives  data updates
  • 19. Gene Ontology (GO) de facto method for functional annotation Assigns functions based upon Biological Process, Molecular Function, Cellular Component Widely used for functional genomics (high throughput) Many tools available for gene expression analysis using GO
  • 20. Plant Ontology (PO) describes plant structures and growth and developmental stages Currently used for Arabidopsis, maize, rice – more being added (soybean, tomato, cotton, etc) Plant Structure: describes morphological and anatomical structures representing organ, tissue and cell types Growth and developmental stages: describes (i) whole plant growth stages and (ii) plant structure developmental stages
  • 21. Use GO for…….1. Determining which classes of gene products are over-represented or under-represented.2. Grouping gene products.3. Relating a protein‟s location to its function.4. Focusing on particular biological pathways and functions (hypothesis-testing).
  • 22. Pathways &Ontologies NetworksGO Cellular Component Pathway Studio 5.0GO Biological Process Ingenuity Pathway AnalysesGO Molecular Function Cytoscape BRENDA Interactome Databases Functional Understanding
  • 23.
  • 24. 1. Provides structural annotation for agriculturally important genomes2. Provides functional annotation (GO)3. Provides tools for functional modeling4. Provides bioinformatics & modeling support for research community
  • 25. Avian Gene Nomenclature
  • 26. GO & PO: literature annotation for rice, computational annotation for rice, maize, sorghum, Brachypodia1. Literature annotation for Agrobacterium tumefaciens, Dickeya dadantii, Magnaporthe grisea, Oomycetes2. Computational annotation for Pseudomonas syringae pv tomato, Phytophthora spp and the nematode Meloidogyne hapla. Literature annotation for chicken, cow, maize, cotton; Computational annotation for agricultural species & pathogens.literature annotation for human;computational annotation forUniProtKB entries (237,201 taxa).
  • 27. Comparing AgBase & EBI-GOA Annotations 14,000 computational 12,000 manual - sequence Gene Products 10,000 manual - literature annotated 8,000 Complementary to EBI-GOA: Genbank 6,000 proteins not represented in UniProt 4,000 & EST sequences on arrays 2,000 0 AgBase EBI-GOA AgBase EBI-GOA Chick Chick Cow Cow Project
  • 28. Contribution to GO Literature Biocuration AgBase EBI GOAChicken 97.82% EBI-IntAct Roslin HGNC < 0.50% UCL-Heart project MGI Cow Reactome 88.78% < 1.50%
  • 29. AgBase Quality Checks & Releases AgBaseBiocurators‘sanity’ check AgBase ‘sanity’ check AgBase GO analysis toolsbiocuration & GOC database Microarray developers interface QC ‘sanity’ check UniProt db EBI GOA QuickGO browser Project GO analysis tools‘sanity’ check: checks Microarray developersto ensure all appropriate ‘sanity’ checkinformation is captured, & GOC QCno obsolete GO:IDs are Public databasesused, etc. AmiGO browser GO Consortium GO analysis tools database Microarray developers
  • 30. Quality improvement Microarray annotations
  • 31. IITA Crops cowpea – “reduced representation” sequencingunderway soybean - preliminary assembly banana - sequencing in progress yam - genome sequencing for Dioscorea alata– EST development (IITA & VSU) cassava - genome sequencing in progress maize - genome sequencing completed; othersubspecies being sequenced
  • 32. Cowpea 54,123 genome sequences 187,483 ESTs Annotated via homology to Arabidopsis & other plants GO annotation via homology – availability?
  • 33. Soybean NCBI: 1,459,639 ESTs, 34,946 proteins, 2,882 genes UniProt: 12,837 proteins (EBI GOA automatic GO annotation) UniGene assemblies available multiple microarrays available
  • 34. Banana 7,102 genome sequences 14,864 ESTs 1,399 NCBI proteins; 680 UniProt Musa acuminata (sweet banana): 3,898 GO annotations to 491 proteins Musa acuminata AAA Group (Cavendish banana): 579 annotations to 96 proteins
  • 35. Plantain Musa ABB Group (taxon:214693) - cooking banana or plantain 11,070 ESTs, 112 proteins 173 GO annotations to 53 proteins functional genomics based on banana?
  • 36. Yams55577 Dioscorea rotundata white yam55571 Dioscorea alata water yam29710 Dioscorea cayenensis yellow yam Dioscorea (taxon:4672) & subspecies NCBI: 31 ESTs, 623 proteins Genome sequencing for Dioscorea alata – EST development (IITA & VSU) 183 GO annotations to 25 proteins
  • 37. Cassava ESTs: 80,631 NCBI proteins: 568, UniProt:253 2,251 GO annotations assigned to 218 proteins 2 Euphorbia esula (leafy spurge) /cassava arrays
  • 38. Maize Zea mays (taxon:4577) Genome sequencing completed by Washington University – other subspecies being sequenced Active GO annotation project - 131,925 GO annotations to 20,288 proteins
  • 39. AgBase Collaborative Model How can we help you? Can make GO annotations public via the GO Consortium Have computational pipelines to do rapid, first pass GO annotation (including transcript/EST sequences) Provide bioinformatics support for collaborators Developing new tools Training/support for modeling data
  • 40. Dr Teresia BuzaDr Susan Bridges Cathy Grisham Divya Pedinti Lakshmi Pillai Philippe Chouvarine Seval Ozkan Hui Wang