bioinformatics enabling knowledge generation from agricultural omics data

AgBase:
bioinformatics enabling
knowledge generation from
agricultural omics data
Fiona McCarthy

Summary
 „omics‟ technologies: the „data deluge‟
 organising data: bioinformatics and
biocuration
 data sharing and analysis: bio-ontologies
 from data to knowledge
 making sense of agricultural data

Databases and Biological Data
 The number of databases has increased
 Sequence repositories: NCBI, EMBL, DDJB
 Model Organism Databases (MODs)
 Specialist biological databases or „knowledge
databases‟ (eg, InterPro, interaction
databases, gene expression data)
 Need to connect information in different
databases
 Databases are increasing in size and
complexity

No.
No. x 106
25000
18

16
20000

14

12
15000

10

8
10000

6

5000
4

2
0
0 „00 „01 „02 „03 „04 „05 „06 „07 „08 „09
70 75 80 85 90 95 00 05

Generating Biological Data
 Amount of biological data is increasing
exponentially
 Completed and ongoing genome
sequencing projects
 High throughput “omics” technologies
 New sequencing technologies
 Existing microarrays
 Proteomics

Biocomputing
 Technologies enable „omics‟ technologies
to move from large database/consortiums
into individual laboratories
 Managing this data:
 acquire
 store
 access
 analyze
 visualize
 share

NIH WORKING DEFINITION OF BIOINFORMATICS AND
COMPUTATIONAL BIOLOGY

Bioinformatics: Research, development, or application of
computational tools and approaches for expanding the use
of biological, medical, behavioral or health data, including
those to acquire, store, organize, archive, analyze, or
visualize such data.

Computational Biology: The development and application of
data-analytical and theoretical methods, mathematical
modeling and computational simulation techniques to the
study of biological, behavioral, and social systems.

Bioinformatics
 Managing data
 different file formats
 linking between different databases
 Adding value
 multiple levels of information from one „omics‟
data set
 re-analysis
 linking data sets
 Organizing
 annotating data
 biocuration - annotation

Annotation
 ANNOTATE: to denote or demarcate
 Genome annotation is the process of
attaching biological information to
genomic sequences. It consists of two
main steps:
1. identifying functional elements in the
genome: “structural annotation”
2. attaching biological information to these
elements: “functional annotation”

Community Annotation
 Researchers are the domain experts – but
relatively few contribute to annotation
 time
 'reward' & 'employer/funding agency recognition'
 training – easy to use tools, clear instructions
 Required submission
 Community annotation
 Groups with special interest do focused
annotation or ontology development
 As part of a meeting/conference or distributed
(eg. wikis)
 Students!

Biocuration
 biocurators are biologists who are trained
to annotate biological data (using
database structures, bio-ontologies, etc).
 databases use biocuration to enhance
value of biological data
 “knowledge databases”
 but how to ensure data consistency
between databases?

What Are Ontologies?
“An ontology is a controlled vocabulary of well defined terms
with specified relationships between those terms, capable of
interpretation by both humans and computers.”
 Bio-ontologies are used to capture biological
information in a way that can be read by both
humans and computers
 annotate data in a consistent way
 allows data sharing across databases
 allows computational analysis of high-throughput
“omics” datasets
 Objects in an ontology (eg. genes, cell types, tissue
types, stages of development) are well defined.

 The ontology shows how the objects relate to each
other

Ontologies
relationships
between terms
digital identifier
(computers)

description
(humans)
Gene Ontology version 1.1348 (27/07/2010):

32,091 terms, 99.3% defined

19,169 biological process
2,745 cellular component
8,736 molecular function

1,441 obsolete terms (not included in figures above)

Relationships: the True Path Rule
 Why are relationships between terms
important?
 TRUE PATH RULE: all attributes of
children must hold for all parents
 so if a protein is annotated to a term, it
must also be true for all the parent
terms
 this enables us to move up the ontology
structure from a granular term to a
broader term
Premise of many GO anaylsis tools

Genomic Annotation
Structural Annotation:
 Open reading frames (ORFs) predicted during
genome assembly
 predicted ORFs require experimental confirmation

Functional Annotation:
 annotation of gene products = Gene Ontology (GO)
annotation
 initially, predicted ORFs have no functional literature
and GO annotation relies on computational methods
(rapid)
 functional literature exists for many genes/proteins
prior to genome sequencing
Gene Ontology annotation does not rely on a
completed genome sequence

Genomic Annotation

Structural Annotation
including Sequence Ontology
Other
annotations
using other bio-
ontologies e.g.
Anatomy
Ontology Nomenclature
(species‟ genome
nomenclature
committees)

Functional annotation using
Gene Ontology

http://obo.sourceforge.net/

Gene Ontology
Plant Ontology
Sequence Ontology
Trait Ontology
Expression/Tissue Ontologies
Infectious Disease Ontology
Cell Ontology

Bio-ontology requirements
 bio-ontologies (Open Biomedical Ontologies)
 computational pipelines („breadth‟)
 for computational annotations
 useful for gene products without published information
 manual biocuration („depth‟)
 requires trained biocurators
 community annotation efforts
 each species has its own body of literature
 biocuration co-ordination
 MODs? Consortium? Community?
 biocuration prioritization
 co-ordination with existing Dbs, annotation, nomenclature
initiatives
 data updates

Gene Ontology (GO)
 de facto method for functional annotation
 Assigns functions based upon Biological
Process, Molecular Function, Cellular
Component
 Widely used for functional genomics (high
throughput)
 Many tools available for gene expression
analysis using GO

http://www.geneontology.org

Plant Ontology (PO)
 describes plant structures and growth and
developmental stages
 Currently used for Arabidopsis, maize, rice – more
being added (soybean, tomato, cotton, etc)
 Plant Structure: describes morphological and
anatomical structures representing organ, tissue and
cell types
 Growth and developmental stages: describes (i)
whole plant growth stages and (ii) plant structure
developmental stages

http://www.plantontology.org/

Use GO for…….
1. Determining which classes of gene products
are over-represented or under-represented.
2. Grouping gene products.
3. Relating a protein‟s location to its function.
4. Focusing on particular biological pathways
and functions (hypothesis-testing).

Pathways &
Ontologies Networks
GO Cellular Component Pathway Studio 5.0
GO Biological Process Ingenuity Pathway Analyses
GO Molecular Function Cytoscape
BRENDA Interactome Databases

Functional Understanding

http://www.agbase.msstate.edu/

1. Provides structural annotation for
agriculturally important genomes
2. Provides functional annotation (GO)
3. Provides tools for functional modeling
4. Provides bioinformatics & modeling
support for research community

GO & PO: literature annotation for rice,
computational annotation for rice,
maize, sorghum, Brachypodia

1. Literature annotation for Agrobacterium
tumefaciens, Dickeya dadantii,
Magnaporthe grisea, Oomycetes
2. Computational annotation for
Pseudomonas syringae pv tomato,
Phytophthora spp and the nematode
Meloidogyne hapla.

Literature annotation for chicken,
cow, maize, cotton;
Computational annotation for
agricultural species & pathogens.

literature annotation for human;
computational annotation for
UniProtKB entries (237,201 taxa).

Comparing AgBase & EBI-GOA Annotations
14,000
computational
12,000
manual - sequence
Gene Products

10,000 manual - literature
annotated

8,000 Complementary to
EBI-GOA: Genbank
6,000 proteins not
represented in UniProt
4,000 & EST sequences on
arrays
2,000

0
AgBase EBI-GOA AgBase EBI-GOA
Chick Chick Cow Cow
Project

Contribution to GO Literature Biocuration
AgBase EBI GOA

Chicken

97.82% EBI-IntAct

Roslin

HGNC
< 0.50%
UCL-Heart project

MGI

Cow Reactome

88.78%

< 1.50%

AgBase Quality Checks & Releases
AgBase
Biocurators
‘sanity’ check

AgBase ‘sanity’
check AgBase GO analysis tools
biocuration & GOC database Microarray developers
interface QC ‘sanity’ check
UniProt db
EBI GOA QuickGO browser
Project GO analysis tools
‘sanity’ check: checks Microarray developers
to ensure all appropriate ‘sanity’ check
information is captured, & GOC QC
no obsolete GO:IDs are Public databases
used, etc. AmiGO browser
GO Consortium GO analysis tools
database Microarray developers

Quality improvement Microarray annotations

IITA Crops
 cowpea – “reduced representation” sequencing
underway
 soybean - preliminary assembly
 banana - sequencing in progress
 yam - genome sequencing for Dioscorea alata
– EST development (IITA & VSU)
 cassava - genome sequencing in progress
 maize - genome sequencing completed; other
subspecies being sequenced

Cowpea
 54,123 genome sequences
 187,483 ESTs
 Annotated via homology to Arabidopsis &
other plants
 GO annotation via homology – availability?

Soybean
 NCBI: 1,459,639 ESTs, 34,946 proteins,
2,882 genes
 UniProt: 12,837 proteins (EBI GOA
automatic GO annotation)
 UniGene assemblies available
 multiple microarrays available

Banana

 7,102 genome sequences
 14,864 ESTs
 1,399 NCBI proteins; 680 UniProt
 Musa acuminata (sweet banana): 3,898
GO annotations to 491 proteins
 Musa acuminata AAA Group (Cavendish
banana): 579 annotations to 96 proteins

Plantain
 Musa ABB Group (taxon:214693) -
cooking banana or plantain
 11,070 ESTs, 112 proteins
 173 GO annotations to 53 proteins
 functional genomics based on banana?

Yams
55577 Dioscorea rotundata white yam
55571 Dioscorea alata water yam
29710 Dioscorea cayenensis yellow yam

 Dioscorea (taxon:4672) & subspecies
 NCBI: 31 ESTs, 623 proteins
 Genome sequencing for Dioscorea alata – EST
development (IITA & VSU)
 183 GO annotations to 25 proteins

Cassava
 ESTs: 80,631
 NCBI proteins: 568, UniProt:253
 2,251 GO annotations assigned to 218 proteins
 2 Euphorbia esula (leafy spurge) /cassava arrays

Maize
 Zea mays (taxon:4577)
 Genome sequencing completed by
Washington University – other subspecies
being sequenced
 Active GO annotation project - 131,925
GO annotations to 20,288 proteins

AgBase Collaborative Model
 How can we help you?
 Can make GO annotations public via the
GO Consortium
 Have computational pipelines to do rapid,
first pass GO annotation (including
transcript/EST sequences)
 Provide bioinformatics support for
collaborators
 Developing new tools
 Training/support for modeling data

Dr Teresia Buza

Dr Susan Bridges Cathy Grisham

Divya Pedinti Lakshmi Pillai

Philippe Chouvarine

Seval Ozkan Hui Wang

bioinformatics enabling knowledge generation from agricultural omics data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to bioinformatics enabling knowledge generation from agricultural omics data

Similar to bioinformatics enabling knowledge generation from agricultural omics data (20)

More from International Institute of Tropical Agriculture

More from International Institute of Tropical Agriculture (20)

Recently uploaded

Recently uploaded (20)

bioinformatics enabling knowledge generation from agricultural omics data