Database talk for Bits & Bites meeting

Database talk for
Bits & Bites meeting
Jill Wegryzn
Department of Plant Sciences
University of California at Davis

Forest Genomics (Conifers)
• Phylogenetic Representation –
– None currently exists. The conifers (gymnosperms) are the oldest of the major
plant clades, arising some 300 million years ago. They are key to our
understanding of the origins of genetic diversity in higher plants.
• Ecological Representation –
– Conifers are of immense ecological importance, comprising the dominant life
forms in most of the temperate and boreal ecosystems in the Northern
Hemisphere.
• Fundamental Genetic Information –
– Reference sequences are the fundamental data necessary to understand
conifer biology and aid in guiding management of genetic resources.
• Development of Genomic Technologies –
– The analytical and computational challenge of building a reference sequence
for such large genomes will drive development of tools, strategies, and human
resources throughout the genomics community.

Existing and Planned Angiosperm Tree Genome Sequences
Species Genome Size1 Number of Status3
Genes2
In Progress With Draft Assemblies
Populustrichocarpa Black Cottonwood 500 Mbp ~ 40,000 2.0 / 2.2
Eucalyptusgrandis Rose Gum 691 Mbp ~36,000 1.0 / 1.1
Malusdomestica Apple 881 Mbp ~26,000 1.0 / 1.0
Prunuspersica Peach 227 Mbp ~28,000 1.0 / 1.0
Citrus sinensis Sweet Orange 319 Mbp ~ 25,000 1.0 / 1.0
Carica papaya Papaya 372 Mbp -
Amborellatrichopoda Amborella 870 Mbp -
In Progress Or Planned – No Published Assemblies
Castaneamollisama Chinese Chestnut 800 Mbp -
Salix purpurea Purple Willow 327 Mbp -
Quercusrobur Pedunculate Oak 740 Mbp -
Populusspp and ecotypes Various various -
Azadirachtaindica Neem 384 Mbp -

1) Genome size: Approximate total size, not completely assembled.
2) Number of Genes: Approximate number of loci containing protein coding sequence.
3) Status: Assembly / Annotation versions; http://www.phytozome.net/ ; http://asgpb.mhpcc.hawaii.edu/papaya/ ; http://www.amborella.org ;
(purple willow – Http://www.poplar.ca/pdf/edomonton11smart.pdf ; Neem - (http://www.strandls.com/viewnews.php?param=5&param1=68

Plant Genome Size Comparisons
40000
3000 Arabidopsis
Oryza
35000 2000 Populus
Pinuslambert
Sorghum iana
1000
Glycine
1C DNA content (Mb)

30000 Zea
0 Pinus
pinaster
Pinus
25000 Picea taeda P. menziesii
Picea glauca
20000 Pseudotsuga abies
menziesii

15000 Taxodium
distichum

10000

5000

0

What can be discovered about a gene by
a database search?
• Best to have specific informational goals:
– Evolutionary information: homologous genes, taxonomic
distributions, allele frequencies, synteny, etc.
– Genomic information: chromosomal location, introns, UTRs,
regulatory regions, shared domains, etc.
– Structural information: associated protein structures, fold types,
structural domains
– Expression information: expression specific to particular tissues,
developmental stages, phenotypes, diseases, etc.
– Functional information: enzymatic/molecular function,
pathway/cellular role, localization, role in diseases

Using a database
• How to get information out of a database:
– Summaries: how many entries, average or extreme
values; rates of change, most recent entries, etc.
– Browsing: getting a sense of the kind and quality of
information available, e.g. checking familiar records
– Search: looking for specific, predefined information
• “Key” to searching a database:
– Must identify the element(s) of the database that are of
interest somehow:
• Gene name, symbol, location or other identifying information.
• Sequences of genes, mRNAs, proteins, etc.
• A crossreference from another database or database generated id.

NCBI and Entrez
• One of the most useful and comprehensive database
collections is the NCBI, part of the National Library of
Medicine.
– Home to GenBank, PubMed & many other familiar DBs.
• NCBI provides interesting summaries, browsers, and
search tools
• Entrez is their database search interface
http://www.ncbi.nlm.nih.gov/Entrez
• Can search on gene names, chromosomal location,
diseases, articles, keywords...

Types of Databases

• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank (nr and nt), SNP, GEO
• Derivative Databases
– Built from primary data
– Content controlled by third party (NCBI)
• Examples: Refseq, Plant
Protein, RefSNP, UniGene, NCBI
Protein, Structure, Conserved Domain

NCBI is not all there is...
• Links to non-NCBI databases (see also “Link Out”)
– Reactome for pathways (also KEGG)
– HGNC for nomenclature
– HPRD protein information
– Regulatory / binding site DBs (e.g. CREB; some not linked)
– IHOP (information hyperlinked over proteins)
• Other important gene/protein resources:
– UniProt (most carefully annotated)
– PDB (main macromolecular structure repository)
– UCSC (best genome viewer & many useful „tracks‟)
– DIP / MINT (protein-protein interactions)
– More: InterPro, MetaCyc, Enzyme, etc. etc.
– Species Databses: TAIR, Gramene, MGI, Wormbase, Flybase.
GDR, TreeGenes
• Alternatives
– SRA versus DNANexus

Flat Files
Characteristics:
• Data is stored as records in regular files
• Records usually have a simple structure and fixed
number of fields
• For fast access may support indexing of fields in the
records
• No mechanisms for relating data between files
• One needs special programs in order to access and
manipulate the data

Limitations of Flat Files
• Most applications require that specific
information can be quickly and efficiently
retrieved
• Often critical that performance does not
degrade as more entities are added
• Flat text files don’t always fulfill these
requirements, especially when there are many
entities and/or relationships

Relational Database
Characteristics:
• Data is organized into tables: rows & columns
• Each row represents an instance of an entity
• Each column represents an attribute of an entity
• Metadata describes each table column
• Relationships between entities are represented by
values stored in the columns of the corresponding
tables (keys)
• Accessible through Standard Query Language (SQL)

Metadata & Data Table
Organism
Name Type Max Length Description
Name Alphanumeric 100 Organism name
Size Integer 10 Genome length (bases)
Gc Float 5 Percent GC
Accession Alphanumeric 10 Accession number
Release Date 8 Release date
Center Alphanumeric 100 Genome center name
Sequence Alphanumeric Variable Sequence

Name Size Gc Accession Release Center Sequence
Escherichia coli K12 4,640,000 50 NC_000913 09/05/1997 Univ. AGCTTTTC
Wisconsin ATT…
Streptococcus 2,040,000 40 NC_003098 09/07/2001 Eli Lilly and TTGAAAGA
pneumoniae R6 Company AAA…
…

Relationships
• Used to connect tables
• Field(s) that have the same value in the related tables
• Organism.Accession=Gene.OAccession
• Organism.Accession
– Unique
– Primary key
• Gene.OAccession
– Not unique
– Secondary key

Schema: Representation of Table
Organization

SQL
• ANSI (American National Standards Institute)
standard computer language for accessing and
manipulating database systems.
• SQL statements are used to retrieve and
update data in a database.
• Includes:
– Data Manipulation Language (DML)
– Data Definition Language (DDL)

DBMS Advantages
• Program-data independence
• Minimal data redundancy
• Improved data consistency & quality
– Access control
– Transaction control
• Improved accessibility & data sharing
• Increased productivity of application development
• Enforced standards

DBMS
• Software package for defining and managing a
database.
• Examples:
– Proprietary: MS Access, MS SQL Server, DB2,
Oracle, Sybase
– Open source: MySql, PostgreSQL

TreeGenes Database
Encompasses Dendrome Resources, DendromePlone, TreeGenes Database &DiversiTree

• Nine modules to store and interrelate data for query and analysis in PostgreSQL
• Direct resource for nearly 2,500 forest geneticists representing 800 organizations
worldwide. Over 6,000 unique visitors in December 2011.
• Forest Geneticists Colleague module
• Literature module
• Transcriptome annotation pipeline and module
• Comparative map module
• Species module
• Sequencing module
• Primers module
• Genotype/EST module
• Phenotype/Expression module
• Sample tracking module

Genomic Resources
678 Species Representing 77 Genus

Generic Model Organism Database

CMAP: Obtaining TreeGenes (TG) Accession Number

(optional) Add additional map files
Obtain TG
Accession
number!

Add literature data and (first) map file

Individual features
and their locations
on map

List of features on
map

GMOD Genome Browser

Search and
Select data source

Tracks can be
reordered or
hidden as necessary

Douglas-fir
Transcriptome Resources in TreeGenes

Gene Ontology
• Gene annotation system

• Controlled vocabulary that can be applied
to all organisms (protein/RNA)

• Used to describe gene products

= bud initiation
Metazoa

= bud initiation
Saccharomyces

= bud initiation
Viridiplantae

What’s in a name?
• The same name can be used to describe
different concepts

What‟s in a name?
• Glucose synthesis
• Glucose biosynthesis
• Glucose formation
• Glucose anabolism
• Gluconeogenesis

• All refer to the process of making glucose from
simpler components

How does GO work?
What information might we want to
capture about a gene product?

• What does the gene product do?
• Why does it perform these activities?
• Where does it act?

The 3 Gene Ontologies
• Molecular Function= elemental activity/task
– the tasks performed by individual gene products; examples are carbohydrate
binding and ATPase activity

• Biological Process= biological goal or objective
– broad biological goals, such as mitosis or purine metabolism, that
are accomplished by ordered assemblies of molecular functions

• Cellular Component= location or complex
– subcellular structures, locations, and macromolecular complexes;
examples include nucleus, telomere, and RNA polymerase II
holoenzyme

Ontology Structure
Ontologies can be represented as graphs,
where the nodes are connected by edges

 Nodes = concepts in the ontology
 Edges = relationships between the concepts

node

edge

node node

Ontology Structure
• The Gene Ontology is structured as a
hierarchical directed acyclic graph (DAG)

• Terms can have more than one parent and
zero, one or more children

• Terms are linked by two relationships
– is-a
– part-of

True Path Rule

• The path from a child term all the way up to its
top-level parent(s) must always be true

cell is-a
 cytoplasm part-of
 chromosome
nuclear chromosome
 cytoplasmic chromosome
 mitochondrial chromosome
 nucleus
 nuclear chromosome

What‟s in a GO term?
term: gluconeogenesis

id: GO:0006094

definition: The formation of glucose from
noncarbohydrate precursors, such as
pyruvate, amino acids and glycerol.

Source of Ontology Assignments
IEAInferredfromElectronicAnnotation
ISSInferred from Sequence Similarity
IEPInferred from Expression Pattern
IMPInferred from Mutant Phenotype
IGIInferred from Genetic Interaction
IPIInferred from Physical Interaction
IDAInferred from Direct Assay
RCA Inferred from Reviewed Computational Analysis
TASTraceable Author Statement
NASNon-traceable Author Statement
ICInferred by Curator
NDNo biological Data available

Ontology Development
Plant Ontology and Trait Ontology

• Plant Ontology
– Structure
• Needle, Cambium
– Growth stages
• Trait Ontology
– Forest Tree Specific Phenotypes
• Wood Density
• PATO
– Phenotypic Qualities

Currently Ontology Listings:
OBO Foundry

Interwebs 101
• Web 1.0 – Hyperlinks
• Web 2.0 – Interactivity, information sharing, user
centered design (wikis, blogs, social media)
• Web 3.0 – Semantic Web
– Data focused
– Answer the limitations of HTML
– HTML describes documents and the links between them.
RDF, OWL, and XML, by contrast, can describe specific
things
– Machine-readable data and relationships between the
data – knowledge processing – deductive reasoning and
inference

Web Services Development
Communication within TreeGenes

• Development of Web Services in cooperation with
NSF’s iPlantCyberinfrastructure Project
– Software system to support interoperable machine to
machine interaction over a network regardless of platform
incompatabilities
– Web service descriptive language (WSDL) is implemented to
relate operations
Service Oriented Architecture Remote Procedure Call (RPC) Representational State Transfer
(SOA) (REST)
With SOAP, the basic unit of RPC Web services define a call REST use HTTP by constraining the
communication is a message interface which the basic unit is interface to standard operations
the WSDL operation. (like GET, POST, PUT, DELETE for
HTTP). The focus is on interacting
with stateful resources, rather
than messages or operations.

SSWAP Ontology
Creating and Contributing to Existing Servlets for Common Genomic Types

Forest Tree Genetic Stock Center

Bulk Retrieval Window Components

Data & Annotation Selection Fields
Bulk Retrieval Window

TreeGenes Sample Tracking System Accurately track samples
through collection, DNA
extraction, and genotyping

Provide a standard and
efficient method to collect
and store phenotypic data

Provide a public interface to
readily query raw
genotype, phenotype, and
association results
(DiversiTree)

Provide interfaces and
database backend to
support a DNA distribution
center (UCD)

Population Genetics
Association Studies, Landscape Genomics

• Currently no other repositories to target association data with geo-referenced data
• dbGAP
• Dryad
• Starting with enforcement at the journal level: Tree Genetics and Genomes

GenSAS development with Content Management
Plone and Drupal
login/signup panel
query sequence panel

data retrieval panel

tool selection panel

task queue panel

GenSAS development
Multiple Gene Prediction Tracks

overview track
control track

sequence track
evidence tracks

custom track
function track

message box

GenSAS integration with Gbrowse
Prototyped with Peach Genome in GDR

Analysis Resources
Custom Databases

Integrating Tools into TreeGenes
Galaxy

Fluxes of CO2 and H20: FLUXNET and Ameriflux

Free Air CO2 Enrichment (FACE)

TRY – Global Database of Plant Traits

• Scientists compiled three million traits for 69,000 out of the world's
~300,000 plant species.
• Worldwide collaboration of scientists from 106 research institutions
• TRY is hosted at the Max Planck Institute for Biogeochemistry in Jena
(Germany)
– Jointly coordinated with:
• University of Leipzig (Germany)
• IMBIV-CONICET (Argentina)
• Macquarie University (Australia)
• CNRS and University of Paris-Sud (France)

Database talk for Bits & Bites meeting

Database talk for Bits & Bites meeting

Recommended

Recommended

More Related Content

Similar to Database talk for Bits & Bites meeting

Similar to Database talk for Bits & Bites meeting (20)

More from Keith Bradnam

More from Keith Bradnam (20)

Recently uploaded

Recently uploaded (20)

Database talk for Bits & Bites meeting

Editor's Notes