Database talk forBits & Bites meeting        Jill Wegryzn  Department of Plant Sciences  University of California at Davis
Forest Genomics (Conifers)• Phylogenetic Representation –   – None currently exists. The conifers (gymnosperms) are the ol...
Existing and Planned Angiosperm Tree Genome Sequences                                Species                              ...
Plant Genome Size Comparisons                      40000                                  3000           Arabidopsis      ...
What can be discovered about a gene by           a database search?• Best to have specific informational goals:   – Evolut...
Using a database• How to get information out of a database:   – Summaries: how many entries, average or extreme     values...
NCBI and Entrez• One of the most useful and comprehensive database  collections is the NCBI, part of the National Library ...
Types of Databases• Primary Databases   – Original submissions by experimentalists   – Content controlled by the submitter...
NCBI is not all there is...• Links to non-NCBI databases (see also “Link Out”)    –   Reactome for pathways (also KEGG)   ...
Flat FilesCharacteristics:• Data is stored as records in regular files• Records usually have a simple structure and fixed ...
Limitations of Flat Files• Most applications require that specific  information can be quickly and efficiently  retrieved•...
Relational DatabaseCharacteristics:• Data is organized into tables: rows & columns• Each row represents an instance of an ...
Metadata & Data TableOrganismName                   Type                  Max Length       DescriptionName                ...
Relationships• Used to connect tables• Field(s) that have the same value in the related tables• Organism.Accession=Gene.OA...
Schema: Representation of Table         Organization
SQL• ANSI (American National Standards Institute)  standard computer language for accessing and  manipulating database sys...
DBMS Advantages• Program-data independence• Minimal data redundancy• Improved data consistency & quality   – Access contro...
DBMS• Software package for defining and managing a  database.• Examples:  – Proprietary: MS Access, MS SQL Server, DB2,   ...
http://dendrome.ucdavis.edu
TreeGenes Database          Encompasses Dendrome Resources, DendromePlone, TreeGenes Database &DiversiTree•   Nine modules...
Genomic Resources678 Species Representing 77 Genus
Generic Model Organism Database
CMAP: Obtaining TreeGenes (TG) Accession Number                                           (optional) Add additional map fi...
Individual featuresand their locationson mapList of features onmap
GMOD Genome Browser       Search andSelect data source      Tracks can be       reordered orhidden as necessary
Douglas-fir              Transcriptome Resources in TreeGenes
Gene Ontology• Gene annotation system• Controlled vocabulary that can be applied  to all organisms (protein/RNA)• Used to ...
= bud initiationMetazoa= bud initiationSaccharomyces= bud initiationViridiplantae
What’s in a name?• The same name can be used to describe  different concepts
What‟s in a name?•   Glucose synthesis•   Glucose biosynthesis•   Glucose formation•   Glucose anabolism•   Gluconeogenesi...
How does GO work?What information might we want tocapture about a gene product?• What does the gene product do?• Why does ...
The 3 Gene Ontologies• Molecular Function= elemental activity/task   – the tasks performed by individual gene products; ex...
Ontology StructureOntologies can be represented as graphs,where the nodes are connected by edges   Nodes = concepts in th...
Ontology Structure• The Gene Ontology is structured as a  hierarchical directed acyclic graph (DAG)• Terms can have more t...
True Path Rule• The path from a child term all the way up to its  top-level parent(s) must always be truecell             ...
What‟s in a GO term?term: gluconeogenesisid: GO:0006094definition: The formation of glucose fromnoncarbohydrate precursors...
Source of Ontology Assignments   IEAInferredfromElectronicAnnotation   ISSInferred from Sequence Similarity   IEPInferred ...
Ontology Development                     Plant Ontology and Trait Ontology• Plant Ontology   – Structure      • Needle, Ca...
Currently Ontology Listings:      OBO Foundry
Interwebs 101• Web 1.0 – Hyperlinks• Web 2.0 – Interactivity, information sharing, user  centered design (wikis, blogs, so...
Web Services Development                                  Communication within TreeGenes   • Development of Web Services i...
SSWAP OntologyCreating and Contributing to Existing Servlets for Common Genomic Types
Forest Tree Genetic Stock Center
Bulk Retrieval Window Components                        Data & Annotation Selection FieldsBulk Retrieval Window
TreeGenes Sample Tracking System   Accurately track samples                                      through collection, DNA  ...
Population Genetics                           Association Studies, Landscape Genomics• Currently no other repositories to ...
GenSAS development with Content Management                         Plone and Drupallogin/signup panel                     ...
GenSAS developmentMultiple Gene Prediction Tracks                                  overview track                         ...
GenSAS integration with Gbrowse   Prototyped with Peach Genome in GDR
Analysis Resources   Custom Databases
Integrating Tools into TreeGenes             Galaxy
Genomicresources
Fluxes of CO2 and H20: FLUXNET and AmerifluxFree Air CO2 Enrichment (FACE)
TRY – Global Database of Plant Traits• Scientists compiled three million traits for 69,000 out of the worlds  ~300,000 pla...
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meeting
Upcoming SlideShare
Loading in...5
×

Database talk for Bits & Bites meeting

3,986

Published on

Published in: Education, Technology
1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,986
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
17
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide
  • How large is a typical genome? There is no simple answer of course, for organisms vary widely in genome size. Arabidopsis, the tiny model species in the mustard family was the first plant to have a fully sequenced genome. It sports a genome of 160 million bases. Poplar, the first tree sequenced, has about 480 million bases in its genome. The corn genome has nearly 2.5 billion bases, and humans around 3 billion. Genome size for conifers is substantially greater. In the figure above, genome size is given for 181 gymnosperms (mostly conifers). They vary in size from 6 to well over 30 billion bases. Some members of the lily family exceed 100 billion bases in size. Explanations for why genome size varies as it does for individual organisms are many and often speculative.
  • Database talk for Bits & Bites meeting

    1. 1. Database talk forBits & Bites meeting Jill Wegryzn Department of Plant Sciences University of California at Davis
    2. 2. Forest Genomics (Conifers)• Phylogenetic Representation – – None currently exists. The conifers (gymnosperms) are the oldest of the major plant clades, arising some 300 million years ago. They are key to our understanding of the origins of genetic diversity in higher plants.• Ecological Representation – – Conifers are of immense ecological importance, comprising the dominant life forms in most of the temperate and boreal ecosystems in the Northern Hemisphere.• Fundamental Genetic Information – – Reference sequences are the fundamental data necessary to understand conifer biology and aid in guiding management of genetic resources.• Development of Genomic Technologies – – The analytical and computational challenge of building a reference sequence for such large genomes will drive development of tools, strategies, and human resources throughout the genomics community.
    3. 3. Existing and Planned Angiosperm Tree Genome Sequences Species Genome Size1 Number of Status3 Genes2In Progress With Draft Assemblies Populustrichocarpa Black Cottonwood 500 Mbp ~ 40,000 2.0 / 2.2 Eucalyptusgrandis Rose Gum 691 Mbp ~36,000 1.0 / 1.1 Malusdomestica Apple 881 Mbp ~26,000 1.0 / 1.0 Prunuspersica Peach 227 Mbp ~28,000 1.0 / 1.0 Citrus sinensis Sweet Orange 319 Mbp ~ 25,000 1.0 / 1.0 Carica papaya Papaya 372 Mbp - Amborellatrichopoda Amborella 870 Mbp -In Progress Or Planned – No Published Assemblies Castaneamollisama Chinese Chestnut 800 Mbp - Salix purpurea Purple Willow 327 Mbp - Quercusrobur Pedunculate Oak 740 Mbp - Populusspp and ecotypes Various various - Azadirachtaindica Neem 384 Mbp -1) Genome size: Approximate total size, not completely assembled.2) Number of Genes: Approximate number of loci containing protein coding sequence.3) Status: Assembly / Annotation versions; http://www.phytozome.net/ ; http://asgpb.mhpcc.hawaii.edu/papaya/ ; http://www.amborella.org ;(purple willow – Http://www.poplar.ca/pdf/edomonton11smart.pdf ; Neem - (http://www.strandls.com/viewnews.php?param=5&param1=68
    4. 4. Plant Genome Size Comparisons 40000 3000 Arabidopsis Oryza 35000 2000 Populus Pinuslambert Sorghum iana 1000 Glycine1C DNA content (Mb) 30000 Zea 0 Pinus pinaster Pinus 25000 Picea taeda P. menziesii Picea glauca 20000 Pseudotsuga abies menziesii 15000 Taxodium distichum 10000 5000 0
    5. 5. What can be discovered about a gene by a database search?• Best to have specific informational goals: – Evolutionary information: homologous genes, taxonomic distributions, allele frequencies, synteny, etc. – Genomic information: chromosomal location, introns, UTRs, regulatory regions, shared domains, etc. – Structural information: associated protein structures, fold types, structural domains – Expression information: expression specific to particular tissues, developmental stages, phenotypes, diseases, etc. – Functional information: enzymatic/molecular function, pathway/cellular role, localization, role in diseases
    6. 6. Using a database• How to get information out of a database: – Summaries: how many entries, average or extreme values; rates of change, most recent entries, etc. – Browsing: getting a sense of the kind and quality of information available, e.g. checking familiar records – Search: looking for specific, predefined information• “Key” to searching a database: – Must identify the element(s) of the database that are of interest somehow: • Gene name, symbol, location or other identifying information. • Sequences of genes, mRNAs, proteins, etc. • A crossreference from another database or database generated id.
    7. 7. NCBI and Entrez• One of the most useful and comprehensive database collections is the NCBI, part of the National Library of Medicine. – Home to GenBank, PubMed & many other familiar DBs.• NCBI provides interesting summaries, browsers, and search tools• Entrez is their database search interface http://www.ncbi.nlm.nih.gov/Entrez• Can search on gene names, chromosomal location, diseases, articles, keywords...
    8. 8. Types of Databases• Primary Databases – Original submissions by experimentalists – Content controlled by the submitter • Examples: GenBank (nr and nt), SNP, GEO• Derivative Databases – Built from primary data – Content controlled by third party (NCBI) • Examples: Refseq, Plant Protein, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain
    9. 9. NCBI is not all there is...• Links to non-NCBI databases (see also “Link Out”) – Reactome for pathways (also KEGG) – HGNC for nomenclature – HPRD protein information – Regulatory / binding site DBs (e.g. CREB; some not linked) – IHOP (information hyperlinked over proteins)• Other important gene/protein resources: – UniProt (most carefully annotated) – PDB (main macromolecular structure repository) – UCSC (best genome viewer & many useful „tracks‟) – DIP / MINT (protein-protein interactions) – More: InterPro, MetaCyc, Enzyme, etc. etc. – Species Databses: TAIR, Gramene, MGI, Wormbase, Flybase. GDR, TreeGenes• Alternatives – SRA versus DNANexus
    10. 10. Flat FilesCharacteristics:• Data is stored as records in regular files• Records usually have a simple structure and fixed number of fields• For fast access may support indexing of fields in the records• No mechanisms for relating data between files• One needs special programs in order to access and manipulate the data
    11. 11. Limitations of Flat Files• Most applications require that specific information can be quickly and efficiently retrieved• Often critical that performance does not degrade as more entities are added• Flat text files don’t always fulfill these requirements, especially when there are many entities and/or relationships
    12. 12. Relational DatabaseCharacteristics:• Data is organized into tables: rows & columns• Each row represents an instance of an entity• Each column represents an attribute of an entity• Metadata describes each table column• Relationships between entities are represented by values stored in the columns of the corresponding tables (keys)• Accessible through Standard Query Language (SQL)
    13. 13. Metadata & Data TableOrganismName Type Max Length DescriptionName Alphanumeric 100 Organism nameSize Integer 10 Genome length (bases)Gc Float 5 Percent GCAccession Alphanumeric 10 Accession numberRelease Date 8 Release dateCenter Alphanumeric 100 Genome center nameSequence Alphanumeric Variable SequenceName Size Gc Accession Release Center SequenceEscherichia coli K12 4,640,000 50 NC_000913 09/05/1997 Univ. AGCTTTTC Wisconsin ATT…Streptococcus 2,040,000 40 NC_003098 09/07/2001 Eli Lilly and TTGAAAGApneumoniae R6 Company AAA……
    14. 14. Relationships• Used to connect tables• Field(s) that have the same value in the related tables• Organism.Accession=Gene.OAccession• Organism.Accession – Unique – Primary key• Gene.OAccession – Not unique – Secondary key
    15. 15. Schema: Representation of Table Organization
    16. 16. SQL• ANSI (American National Standards Institute) standard computer language for accessing and manipulating database systems.• SQL statements are used to retrieve and update data in a database.• Includes: – Data Manipulation Language (DML) – Data Definition Language (DDL)
    17. 17. DBMS Advantages• Program-data independence• Minimal data redundancy• Improved data consistency & quality – Access control – Transaction control• Improved accessibility & data sharing• Increased productivity of application development• Enforced standards
    18. 18. DBMS• Software package for defining and managing a database.• Examples: – Proprietary: MS Access, MS SQL Server, DB2, Oracle, Sybase – Open source: MySql, PostgreSQL
    19. 19. http://dendrome.ucdavis.edu
    20. 20. TreeGenes Database Encompasses Dendrome Resources, DendromePlone, TreeGenes Database &DiversiTree• Nine modules to store and interrelate data for query and analysis in PostgreSQL • Direct resource for nearly 2,500 forest geneticists representing 800 organizations worldwide. Over 6,000 unique visitors in December 2011. • Forest Geneticists Colleague module • Literature module • Transcriptome annotation pipeline and module • Comparative map module • Species module • Sequencing module • Primers module • Genotype/EST module • Phenotype/Expression module • Sample tracking module
    21. 21. Genomic Resources678 Species Representing 77 Genus
    22. 22. Generic Model Organism Database
    23. 23. CMAP: Obtaining TreeGenes (TG) Accession Number (optional) Add additional map files Obtain TG Accession number!Add literature data and (first) map file
    24. 24. Individual featuresand their locationson mapList of features onmap
    25. 25. GMOD Genome Browser Search andSelect data source Tracks can be reordered orhidden as necessary
    26. 26. Douglas-fir Transcriptome Resources in TreeGenes
    27. 27. Gene Ontology• Gene annotation system• Controlled vocabulary that can be applied to all organisms (protein/RNA)• Used to describe gene products
    28. 28. = bud initiationMetazoa= bud initiationSaccharomyces= bud initiationViridiplantae
    29. 29. What’s in a name?• The same name can be used to describe different concepts
    30. 30. What‟s in a name?• Glucose synthesis• Glucose biosynthesis• Glucose formation• Glucose anabolism• Gluconeogenesis• All refer to the process of making glucose from simpler components
    31. 31. How does GO work?What information might we want tocapture about a gene product?• What does the gene product do?• Why does it perform these activities?• Where does it act?
    32. 32. The 3 Gene Ontologies• Molecular Function= elemental activity/task – the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity• Biological Process= biological goal or objective – broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions• Cellular Component= location or complex – subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme
    33. 33. Ontology StructureOntologies can be represented as graphs,where the nodes are connected by edges Nodes = concepts in the ontology Edges = relationships between the concepts node edge node node
    34. 34. Ontology Structure• The Gene Ontology is structured as a hierarchical directed acyclic graph (DAG)• Terms can have more than one parent and zero, one or more children• Terms are linked by two relationships – is-a – part-of
    35. 35. True Path Rule• The path from a child term all the way up to its top-level parent(s) must always be truecell is-a  cytoplasm part-of  chromosome nuclear chromosome  cytoplasmic chromosome  mitochondrial chromosome  nucleus  nuclear chromosome
    36. 36. What‟s in a GO term?term: gluconeogenesisid: GO:0006094definition: The formation of glucose fromnoncarbohydrate precursors, such aspyruvate, amino acids and glycerol.
    37. 37. Source of Ontology Assignments IEAInferredfromElectronicAnnotation ISSInferred from Sequence Similarity IEPInferred from Expression Pattern IMPInferred from Mutant Phenotype IGIInferred from Genetic Interaction IPIInferred from Physical Interaction IDAInferred from Direct Assay RCA Inferred from Reviewed Computational Analysis TASTraceable Author Statement NASNon-traceable Author Statement ICInferred by Curator NDNo biological Data available
    38. 38. Ontology Development Plant Ontology and Trait Ontology• Plant Ontology – Structure • Needle, Cambium – Growth stages• Trait Ontology – Forest Tree Specific Phenotypes • Wood Density• PATO – Phenotypic Qualities
    39. 39. Currently Ontology Listings: OBO Foundry
    40. 40. Interwebs 101• Web 1.0 – Hyperlinks• Web 2.0 – Interactivity, information sharing, user centered design (wikis, blogs, social media)• Web 3.0 – Semantic Web – Data focused – Answer the limitations of HTML – HTML describes documents and the links between them. RDF, OWL, and XML, by contrast, can describe specific things – Machine-readable data and relationships between the data – knowledge processing – deductive reasoning and inference
    41. 41. Web Services Development Communication within TreeGenes • Development of Web Services in cooperation with NSF’s iPlantCyberinfrastructure Project – Software system to support interoperable machine to machine interaction over a network regardless of platform incompatabilities – Web service descriptive language (WSDL) is implemented to relate operationsService Oriented Architecture Remote Procedure Call (RPC) Representational State Transfer(SOA) (REST)With SOAP, the basic unit of RPC Web services define a call REST use HTTP by constraining thecommunication is a message interface which the basic unit is interface to standard operations the WSDL operation. (like GET, POST, PUT, DELETE for HTTP). The focus is on interacting with stateful resources, rather than messages or operations.
    42. 42. SSWAP OntologyCreating and Contributing to Existing Servlets for Common Genomic Types
    43. 43. Forest Tree Genetic Stock Center
    44. 44. Bulk Retrieval Window Components Data & Annotation Selection FieldsBulk Retrieval Window
    45. 45. TreeGenes Sample Tracking System Accurately track samples through collection, DNA extraction, and genotyping Provide a standard and efficient method to collect and store phenotypic data Provide a public interface to readily query raw genotype, phenotype, and association results (DiversiTree) Provide interfaces and database backend to support a DNA distribution center (UCD)
    46. 46. Population Genetics Association Studies, Landscape Genomics• Currently no other repositories to target association data with geo-referenced data • dbGAP • Dryad• Starting with enforcement at the journal level: Tree Genetics and Genomes
    47. 47. GenSAS development with Content Management Plone and Drupallogin/signup panel query sequence paneldata retrieval panel tool selection panel task queue panel
    48. 48. GenSAS developmentMultiple Gene Prediction Tracks overview track control track sequence track evidence tracks custom track function track message box
    49. 49. GenSAS integration with Gbrowse Prototyped with Peach Genome in GDR
    50. 50. Analysis Resources Custom Databases
    51. 51. Integrating Tools into TreeGenes Galaxy
    52. 52. Genomicresources
    53. 53. Fluxes of CO2 and H20: FLUXNET and AmerifluxFree Air CO2 Enrichment (FACE)
    54. 54. TRY – Global Database of Plant Traits• Scientists compiled three million traits for 69,000 out of the worlds ~300,000 plant species.• Worldwide collaboration of scientists from 106 research institutions• TRY is hosted at the Max Planck Institute for Biogeochemistry in Jena (Germany) – Jointly coordinated with: • University of Leipzig (Germany) • IMBIV-CONICET (Argentina) • Macquarie University (Australia) • CNRS and University of Paris-Sud (France)
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×