Successfully reported this slideshow.
Your SlideShare is downloading. ×

Databases_CSS2.pptx

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
HGP.ppt
HGP.ppt
Loading in …3
×

Check these out next

1 of 50 Ad
Advertisement

More Related Content

Advertisement

Databases_CSS2.pptx

  1. 1. Databases • A data structure that stores organized information. Most databases contain multiple tables, which may each include several different fields. • A database-management system (DBMS) is a computer-software application that interacts with end-users, other applications, and the database itself to capture and analyze data. A general- purpose DBMS allows the definition, creation, querying, update, and administration of databases.
  2. 2. Biological databases • Libraries of life sciences information, collected from scientific experiments, published literature, high- throughput experiment technology, and computational analysis.They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogentics. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures. • Biological databases can be broadly classified into sequence, structure and functional databases.
  3. 3. Biological databases • Contains files or tables, each containing numerous records and fields • Simplest form, either a large single text file or collection of text files • Commonest type, stores the data within a number of tables (with records and fields). Each table will link each other by a shared file called a key
  4. 4. Bibliography
  5. 5. Flat file Relational database model The operators are written in query-specific languages based on relational algebra Structured Query Language (SQL) is commonly used
  6. 6. • XML (eXtensible Markup Language) is now a general tool for storage of data and information. HTML and XHTML are subsets of XML. • The key feature is to use identifiers called tabs • <title> Understanding Bioinformatics </ title> • <publisher> tag can be defined and used to identify book publishers • Extraction from XML file is similar to database querying.
  7. 7. Databases Information system Query system Storage System Data GenBank flat file PDB file Interaction Record Title of a book Book
  8. 8. Databases Information system Query system Storage System Data Boxes Oracle MySQL PC binary files Unix text files Bookshelves
  9. 9. Databases Information system Query system Storage System Data A List you look at A catalogue indexed files SQL grep
  10. 10. The UBC library Google Entrez SRS Databases Information system Query system Storage System Data
  11. 11. Bioinformatics Information Space July 17, 1999 • Nucleotide sequences: 4,456,822 • Protein sequences: 706,862 • 3D structures: 9,780 • Human Unigene Clusters: 75,832 • Maps and Complete Genomes: 10,870 • Different species node: 52,889 • dbSNP 6,377 • RefGenes 515 • human contigs > 250 kb 341 (4.9MB) • PubMed records: 10,372,886 • OMIM records: 10,695
  12. 12. The challenge of the information space: Nucleotide records 36,653,899 Protein sequences 4,436,362 3D structures 19,640 Interactions & complexes 52,385 Human Unigene Cluster 118,517 Maps and Complete Genomes 6,948 Different taxonomy Nodes 283,121 Human dbSNP 13,179,601 Human RefSeq records 22,079 bp in Human Contigs > 5,000 kb (116) 2,487,920,000 PubMed records 12,570,540 OMIM records 15,138 Feb 10 2004
  13. 13. Databases • Primary (archival) – GenBank/EMBL/DDBJ – UniProt – PDB – Medline (PubMed) – BIND • Secondary (curated) – RefSeq – Taxon – UniProt – OMIM – SGD
  14. 14. http://nar.oupjournals.org/content/vol31/issue1/
  15. 15. Tools of trade for the “armchair scientist” • Databases – PubMed and other NCBI databases – Biochemical databases – Protein domain databases – Structural databases – Genome comparison databases • Tools – CDD / COGs – VAST / FSSP
  16. 16. Distribution of the type of databases as classified at the NAR database web site
  17. 17. Types of databases • Archival or Primary Data – Text: PubMed – DNA Sequence: GenBank – Protein Sequence: Entrez Proteins, TREMBL – Protein Structures: PDB • Curated or Processed Data – DNA sequences : RefSeq, LocusLink, OMIM – Protein Sequences: SWISS-PROT, PIR – Protein Structures : SCOP, CATH, MMDB – Genomes: Entrez Genomes, COGs Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases
  18. 18. The National Center for Biotechnology Information (NCBI) • Created as a part of the National Library of Medicine, National Institutes of Health in 1988 – Establish public databases – Research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information • Tools: BLAST(1990), Entrez (1992) • GenBank (1992) • Free MEDLINE (PubMed, 1997) • Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink, RefSeq
  19. 19. What is GenBank? • Archival nucleotide sequence database • Sample slogans: “Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served” • Data are shared nightly among three collaborating databases: • GenBank at NCBI - Bethesda, Maryland, USA • DNA Database of Japan (DDBJ) at NIG - Mishima, Japan • European Molecular Biology Laboratory Database (EMBL) at EBI - Hinxton, UK
  20. 20. Some guiding principles of working with GenBank • GenBank is a nucleotide-centric view of the information space • GenBank is a repository of all publically available sequences • In GenBank, records are grouped for various reasons • Data in GenBank is only as good as what you put in
  21. 21. NCBI databases and their links Word Weight VAST BLAST BLAST Phylogeny Genomes Taxonomy Nucleotide Sequences Protein Sequences Article Abstracts Medline 3-D Structure 3 D Structure MMDB
  22. 22. PDB • Protein DataBase – Protein and NA 3D structures – Sequence present – YAFFF
  23. 23. HEADER LEUCINE ZIPPER 15-JUL- 93 1DGC 1DGC 2 COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 COMPND 2 ATF/CREB SITE DNA 1DGC 4 SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 AUTHOR T.J.RICHMOND 1DGC 6 REVDAT 1 22-JUN-94 1DGC 0 1DGC 7 JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8 JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 JRNL TITL 3 FLEXIBILITY 1DGC 11 JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12 JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13 REMARK 1 1DGC 14 REMARK 2 1DGC 15 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 REMARK 3 1DGC 17 REMARK 3 REFINEMENT. 1DGC 18 REMARK 3 PROGRAM X-PLOR 1DGC 19 REMARK 3 AUTHORS BRUNGER 1DGC 20 REMARK 3 R VALUE 0.216 1DGC 21 REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22 REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23 PDB • HEADER • COMPND • SOURCE • AUTHOR • DATE • JRNL • REMARK • SECRES • ATOM COORDINATES
  24. 24. Accessing information on molecular sequences Page 26
  25. 25. [rest of protein sequence deleted for brevity] [rest of nucleotide sequence deleted for brevity] GenBank Record Accession Number gi Number Protein Sequence Nucleotide Sequence Locus Name Medline ID GenPept ID
  26. 26. LOCUS, Accession, NID and protein_id LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier. ACCESSION: A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication. VERSION: : New system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS. Protein gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. Protein_id: Identifier which has the same structure and function as the nucleotide Accession.version numbers, but slightlt different format.
  27. 27. Protein sequence motif is a descriptor of a protein family • Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C- [LIVMFYN]-G-x-[QEH]- x-[LIVMFA] [C is the active site residue] • Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG] [C is the active site residue]
  28. 28. Searching MMDB
  29. 29. Principles of structural alignment • Dali: http://www.ebi.ac.uk/dali/ Looks for minimal RMSD between Ca atoms. Calculate Ca - Ca distance matrices, then identifies the longest alignable segments • VAST (Vector Alignment Search Tool) http://www.ncbi.nlm.nih.gov/Structure/ looks for pairs of secondary structure elements (a-helices, b-strands) that have similar orientation and connectivity
  30. 30. Dali alignment of Tyr phosphatase
  31. 31. VAST Structure Neighbors
  32. 32. Structure Summary Cn3D viewer VAST neighbors BLAST neighbors
  33. 33. Cn3D : Displaying Structures Chloroquine
  34. 34. Structure Neighbors
  35. 35. Use of structural alignments Chloroquine NADH
  36. 36. PDB • Protein DataBase – Protein and NA 3D structures – Sequence present – YAFFF
  37. 37. UniProt • New protein sequence database that is the result of a merge from SWISS-PROT and PIR. It will be the annotated curated protein sequence database. • Data in UniProt is primarily derived from coding sequence annotations in EMBL (GenBank/DDBJ) nucleic acid sequence data. • UniProt is a Flat-File database just like EMBL and GenBank • Flat-File format is SwissProt-like, or EMBL-like
  38. 38. Swiss-Prot
  39. 39. • SWISS-PROT incorporates: •Function of the protein •Post-translational modification •Domains and sites. •Secondary structure. •Quaternary structure. •Similarities to other proteins; •Diseases associated with deficiencies in the protein •Sequence conflicts, variants, etc. Swiss-Prot

×