The document discusses developing a reference genome sequence for conifers. Key points:
1) Conifers are ecologically and evolutionarily important but no reference genome currently exists for the group.
2) Developing a conifer reference genome would provide fundamental genetic information and drive the development of genomic tools.
3) Several angiosperm tree genomes have been sequenced or are in progress, but no published conifer genome assemblies yet. The document outlines existing and planned angiosperm tree genome projects.
1. Database talk for
Bits & Bites meeting
Jill Wegryzn
Department of Plant Sciences
University of California at Davis
2. Forest Genomics (Conifers)
• Phylogenetic Representation –
– None currently exists. The conifers (gymnosperms) are the oldest of the major
plant clades, arising some 300 million years ago. They are key to our
understanding of the origins of genetic diversity in higher plants.
• Ecological Representation –
– Conifers are of immense ecological importance, comprising the dominant life
forms in most of the temperate and boreal ecosystems in the Northern
Hemisphere.
• Fundamental Genetic Information –
– Reference sequences are the fundamental data necessary to understand
conifer biology and aid in guiding management of genetic resources.
• Development of Genomic Technologies –
– The analytical and computational challenge of building a reference sequence
for such large genomes will drive development of tools, strategies, and human
resources throughout the genomics community.
3. Existing and Planned Angiosperm Tree Genome Sequences
Species Genome Size1 Number of Status3
Genes2
In Progress With Draft Assemblies
Populustrichocarpa Black Cottonwood 500 Mbp ~ 40,000 2.0 / 2.2
Eucalyptusgrandis Rose Gum 691 Mbp ~36,000 1.0 / 1.1
Malusdomestica Apple 881 Mbp ~26,000 1.0 / 1.0
Prunuspersica Peach 227 Mbp ~28,000 1.0 / 1.0
Citrus sinensis Sweet Orange 319 Mbp ~ 25,000 1.0 / 1.0
Carica papaya Papaya 372 Mbp -
Amborellatrichopoda Amborella 870 Mbp -
In Progress Or Planned – No Published Assemblies
Castaneamollisama Chinese Chestnut 800 Mbp -
Salix purpurea Purple Willow 327 Mbp -
Quercusrobur Pedunculate Oak 740 Mbp -
Populusspp and ecotypes Various various -
Azadirachtaindica Neem 384 Mbp -
1) Genome size: Approximate total size, not completely assembled.
2) Number of Genes: Approximate number of loci containing protein coding sequence.
3) Status: Assembly / Annotation versions; http://www.phytozome.net/ ; http://asgpb.mhpcc.hawaii.edu/papaya/ ; http://www.amborella.org ;
(purple willow – Http://www.poplar.ca/pdf/edomonton11smart.pdf ; Neem - (http://www.strandls.com/viewnews.php?param=5¶m1=68
5. What can be discovered about a gene by
a database search?
• Best to have specific informational goals:
– Evolutionary information: homologous genes, taxonomic
distributions, allele frequencies, synteny, etc.
– Genomic information: chromosomal location, introns, UTRs,
regulatory regions, shared domains, etc.
– Structural information: associated protein structures, fold types,
structural domains
– Expression information: expression specific to particular tissues,
developmental stages, phenotypes, diseases, etc.
– Functional information: enzymatic/molecular function,
pathway/cellular role, localization, role in diseases
6. Using a database
• How to get information out of a database:
– Summaries: how many entries, average or extreme
values; rates of change, most recent entries, etc.
– Browsing: getting a sense of the kind and quality of
information available, e.g. checking familiar records
– Search: looking for specific, predefined information
• “Key” to searching a database:
– Must identify the element(s) of the database that are of
interest somehow:
• Gene name, symbol, location or other identifying information.
• Sequences of genes, mRNAs, proteins, etc.
• A crossreference from another database or database generated id.
7. NCBI and Entrez
• One of the most useful and comprehensive database
collections is the NCBI, part of the National Library of
Medicine.
– Home to GenBank, PubMed & many other familiar DBs.
• NCBI provides interesting summaries, browsers, and
search tools
• Entrez is their database search interface
http://www.ncbi.nlm.nih.gov/Entrez
• Can search on gene names, chromosomal location,
diseases, articles, keywords...
8. Types of Databases
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank (nr and nt), SNP, GEO
• Derivative Databases
– Built from primary data
– Content controlled by third party (NCBI)
• Examples: Refseq, Plant
Protein, RefSNP, UniGene, NCBI
Protein, Structure, Conserved Domain
9.
10. NCBI is not all there is...
• Links to non-NCBI databases (see also “Link Out”)
– Reactome for pathways (also KEGG)
– HGNC for nomenclature
– HPRD protein information
– Regulatory / binding site DBs (e.g. CREB; some not linked)
– IHOP (information hyperlinked over proteins)
• Other important gene/protein resources:
– UniProt (most carefully annotated)
– PDB (main macromolecular structure repository)
– UCSC (best genome viewer & many useful „tracks‟)
– DIP / MINT (protein-protein interactions)
– More: InterPro, MetaCyc, Enzyme, etc. etc.
– Species Databses: TAIR, Gramene, MGI, Wormbase, Flybase.
GDR, TreeGenes
• Alternatives
– SRA versus DNANexus
11. Flat Files
Characteristics:
• Data is stored as records in regular files
• Records usually have a simple structure and fixed
number of fields
• For fast access may support indexing of fields in the
records
• No mechanisms for relating data between files
• One needs special programs in order to access and
manipulate the data
12. Limitations of Flat Files
• Most applications require that specific
information can be quickly and efficiently
retrieved
• Often critical that performance does not
degrade as more entities are added
• Flat text files don’t always fulfill these
requirements, especially when there are many
entities and/or relationships
13. Relational Database
Characteristics:
• Data is organized into tables: rows & columns
• Each row represents an instance of an entity
• Each column represents an attribute of an entity
• Metadata describes each table column
• Relationships between entities are represented by
values stored in the columns of the corresponding
tables (keys)
• Accessible through Standard Query Language (SQL)
14. Metadata & Data Table
Organism
Name Type Max Length Description
Name Alphanumeric 100 Organism name
Size Integer 10 Genome length (bases)
Gc Float 5 Percent GC
Accession Alphanumeric 10 Accession number
Release Date 8 Release date
Center Alphanumeric 100 Genome center name
Sequence Alphanumeric Variable Sequence
Name Size Gc Accession Release Center Sequence
Escherichia coli K12 4,640,000 50 NC_000913 09/05/1997 Univ. AGCTTTTC
Wisconsin ATT…
Streptococcus 2,040,000 40 NC_003098 09/07/2001 Eli Lilly and TTGAAAGA
pneumoniae R6 Company AAA…
…
15. Relationships
• Used to connect tables
• Field(s) that have the same value in the related tables
• Organism.Accession=Gene.OAccession
• Organism.Accession
– Unique
– Primary key
• Gene.OAccession
– Not unique
– Secondary key
17. SQL
• ANSI (American National Standards Institute)
standard computer language for accessing and
manipulating database systems.
• SQL statements are used to retrieve and
update data in a database.
• Includes:
– Data Manipulation Language (DML)
– Data Definition Language (DDL)
18. DBMS Advantages
• Program-data independence
• Minimal data redundancy
• Improved data consistency & quality
– Access control
– Transaction control
• Improved accessibility & data sharing
• Increased productivity of application development
• Enforced standards
19. DBMS
• Software package for defining and managing a
database.
• Examples:
– Proprietary: MS Access, MS SQL Server, DB2,
Oracle, Sybase
– Open source: MySql, PostgreSQL
21. TreeGenes Database
Encompasses Dendrome Resources, DendromePlone, TreeGenes Database &DiversiTree
• Nine modules to store and interrelate data for query and analysis in PostgreSQL
• Direct resource for nearly 2,500 forest geneticists representing 800 organizations
worldwide. Over 6,000 unique visitors in December 2011.
• Forest Geneticists Colleague module
• Literature module
• Transcriptome annotation pipeline and module
• Comparative map module
• Species module
• Sequencing module
• Primers module
• Genotype/EST module
• Phenotype/Expression module
• Sample tracking module
30. What’s in a name?
• The same name can be used to describe
different concepts
31. What‟s in a name?
• Glucose synthesis
• Glucose biosynthesis
• Glucose formation
• Glucose anabolism
• Gluconeogenesis
• All refer to the process of making glucose from
simpler components
32. How does GO work?
What information might we want to
capture about a gene product?
• What does the gene product do?
• Why does it perform these activities?
• Where does it act?
33. The 3 Gene Ontologies
• Molecular Function= elemental activity/task
– the tasks performed by individual gene products; examples are carbohydrate
binding and ATPase activity
• Biological Process= biological goal or objective
– broad biological goals, such as mitosis or purine metabolism, that
are accomplished by ordered assemblies of molecular functions
• Cellular Component= location or complex
– subcellular structures, locations, and macromolecular complexes;
examples include nucleus, telomere, and RNA polymerase II
holoenzyme
34. Ontology Structure
Ontologies can be represented as graphs,
where the nodes are connected by edges
Nodes = concepts in the ontology
Edges = relationships between the concepts
node
edge
node node
35. Ontology Structure
• The Gene Ontology is structured as a
hierarchical directed acyclic graph (DAG)
• Terms can have more than one parent and
zero, one or more children
• Terms are linked by two relationships
– is-a
– part-of
36. True Path Rule
• The path from a child term all the way up to its
top-level parent(s) must always be true
cell is-a
cytoplasm part-of
chromosome
nuclear chromosome
cytoplasmic chromosome
mitochondrial chromosome
nucleus
nuclear chromosome
37. What‟s in a GO term?
term: gluconeogenesis
id: GO:0006094
definition: The formation of glucose from
noncarbohydrate precursors, such as
pyruvate, amino acids and glycerol.
38. Source of Ontology Assignments
IEAInferredfromElectronicAnnotation
ISSInferred from Sequence Similarity
IEPInferred from Expression Pattern
IMPInferred from Mutant Phenotype
IGIInferred from Genetic Interaction
IPIInferred from Physical Interaction
IDAInferred from Direct Assay
RCA Inferred from Reviewed Computational Analysis
TASTraceable Author Statement
NASNon-traceable Author Statement
ICInferred by Curator
NDNo biological Data available
39. Ontology Development
Plant Ontology and Trait Ontology
• Plant Ontology
– Structure
• Needle, Cambium
– Growth stages
• Trait Ontology
– Forest Tree Specific Phenotypes
• Wood Density
• PATO
– Phenotypic Qualities
41. Interwebs 101
• Web 1.0 – Hyperlinks
• Web 2.0 – Interactivity, information sharing, user
centered design (wikis, blogs, social media)
• Web 3.0 – Semantic Web
– Data focused
– Answer the limitations of HTML
– HTML describes documents and the links between them.
RDF, OWL, and XML, by contrast, can describe specific
things
– Machine-readable data and relationships between the
data – knowledge processing – deductive reasoning and
inference
42. Web Services Development
Communication within TreeGenes
• Development of Web Services in cooperation with
NSF’s iPlantCyberinfrastructure Project
– Software system to support interoperable machine to
machine interaction over a network regardless of platform
incompatabilities
– Web service descriptive language (WSDL) is implemented to
relate operations
Service Oriented Architecture Remote Procedure Call (RPC) Representational State Transfer
(SOA) (REST)
With SOAP, the basic unit of RPC Web services define a call REST use HTTP by constraining the
communication is a message interface which the basic unit is interface to standard operations
the WSDL operation. (like GET, POST, PUT, DELETE for
HTTP). The focus is on interacting
with stateful resources, rather
than messages or operations.
47. TreeGenes Sample Tracking System Accurately track samples
through collection, DNA
extraction, and genotyping
Provide a standard and
efficient method to collect
and store phenotypic data
Provide a public interface to
readily query raw
genotype, phenotype, and
association results
(DiversiTree)
Provide interfaces and
database backend to
support a DNA distribution
center (UCD)
48. Population Genetics
Association Studies, Landscape Genomics
• Currently no other repositories to target association data with geo-referenced data
• dbGAP
• Dryad
• Starting with enforcement at the journal level: Tree Genetics and Genomes
49.
50. GenSAS development with Content Management
Plone and Drupal
login/signup panel
query sequence panel
data retrieval panel
tool selection panel
task queue panel
51. GenSAS development
Multiple Gene Prediction Tracks
overview track
control track
sequence track
evidence tracks
custom track
function track
message box
56. Fluxes of CO2 and H20: FLUXNET and Ameriflux
Free Air CO2 Enrichment (FACE)
57. TRY – Global Database of Plant Traits
• Scientists compiled three million traits for 69,000 out of the world's
~300,000 plant species.
• Worldwide collaboration of scientists from 106 research institutions
• TRY is hosted at the Max Planck Institute for Biogeochemistry in Jena
(Germany)
– Jointly coordinated with:
• University of Leipzig (Germany)
• IMBIV-CONICET (Argentina)
• Macquarie University (Australia)
• CNRS and University of Paris-Sud (France)
Editor's Notes
How large is a typical genome? There is no simple answer of course, for organisms vary widely in genome size. Arabidopsis, the tiny model species in the mustard family was the first plant to have a fully sequenced genome. It sports a genome of 160 million bases. Poplar, the first tree sequenced, has about 480 million bases in its genome. The corn genome has nearly 2.5 billion bases, and humans around 3 billion. Genome size for conifers is substantially greater. In the figure above, genome size is given for 181 gymnosperms (mostly conifers). They vary in size from 6 to well over 30 billion bases. Some members of the lily family exceed 100 billion bases in size. Explanations for why genome size varies as it does for individual organisms are many and often speculative.