Biological Database Systems Denis Shestakov,  University of Turku/Tampere
Course Information Course structure: Lectures:  approx. 12 (plus today’s intro and review lecture in the end of the course) Project work:  details will be given next time Exam:  easy to pass if project is done URL:
Course Information Dates: Period 2: 27.11, 4.12, 11.12 Period 3: 10 meetings on Mondays/Wednesdays Contact info: Email:  ICT, B6019: at 15-18 on Tuesdays
Course Information: Literature Slides References in the end of slides Books: Bioinformatics: Managing Scientific Data  by Lacroix & Critchlow, Morgan Kaufmann, 2003  ISBN-10: 155860829X Database Systems Concepts, 5 th  edition  by Silbershatz, Korth & Sudarshan, McGraw-Hill, 2005  ISBN-10: 0072958863 Articles: Biological database design and implementation  by Birney & Clamp (the Ensembl project), Briefings in Bioinformatics, 5(1):31-38, 2004
Biological Database Systems 1.1.  Course Content 1.2.  Course Objectives 1.3.  Database and DBMS 1.4.  Biological Databases
Course content: main topics Database concepts, database design process Relational data model Introduction to SQL XML and XML-based databases Data structures for biological data: storage and querying Model organism databases
Course content: main topics LIMS, BioPostgres Analysis workflows, web services Integration of biological data Integration of biological data, example of integration system Research issues in scientific databases * Project discussion, exam preparation
Course focus  Database issues: Biology -specific Representation of  biological  data Design  of  biological  databases NOT about: Usage of existing databases Accessing/retrieving data from bio-databases
Course goal Give basic knowledge of biological* database design * - for molecular biology
Do you need to know that? Work in “wet” laboratory: One bioinformatician and many biologists Likely to be IT guru for others Expect to answer IT-related questions Work in bioinformatics lab: Many bioinformaticians Group may maintain several dbs Basics are helpful Create/maintain biological databases Start learning! Ask for more information
Database? From Merriam-Webster dictionary: (http://www.merriam-webster.com/dictionary/database)
Database? A collection of data: structured searchable (i.e., indexable) updated cross-referenced Objective: Transform  “meaningless” raw data into useful information which can be accessed and analysed in the best way Data b ase Management System (DBMS): software designed for the purpose of managing databases (access, insert, delete, update, etc.)
DBMS A set of tools that: Store Extract Modify Database Store Extract Modify USERS
Biological Databases? Explosive growth in biological data E.g., tremendous increase in nucleotide sequences (first increase in data due to the polymerase chain reaction (PCR) technique development in 1983) 1980: 80 genes fully sequenced …
Biological Databases? EMBL Database Growth: Total nucleotides (Nov 07:  188,490,792,445 ) Number of entries (Nov 07:  106,144,026 )
Biological Databases? Data (genomic sequences, 3D structures, 2D gel analysis, microarrays….) directly submitted to databases Essential tools for biological research, like reading relevant literature
Biological Databases: History 1965 Margaret Dayhoff  et al.  publish “Atlas of Protein Sequences and Structures” 1982 EMBL initiates DNA sequence databases, followed within a year by GenBank and in 1984 by the DNA Database of Japan 1988 EMBL/GenBank/DDBJ agree on common format for data elements
Biological Databases: some statistics More than 1000 different databases 968 databases reported in The Molecular Biology Database Collection: 2007 update  by Galperin, Nucleic Acids Research, 2007, Vol. 35, Database issue D3-D4   Metabase: database of biological databases,  http://biodatabase.org/index.php/Main_Page Database sizes: <100kB to >100GB (EMBL >500GB) DNA: >100GB Protein: 1GB 3D structure: 5GB Update frequency: daily to annyally Freely accessible (as a rule)
Some databases in the field of molecular biology AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,  ARR, AsDb,  BBDB, BCGD,  Beanref, Biolmage, BioMagResBank,  BIOMDB,  BLOCKS,  BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline,  GenLink,  GENOTK,  GenProtEC,  GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,  YPM, etc … Find more at http://biodatabase.org
Categories of Biological Databases Nucleotide sequences Genomics Mutation/polymorphism Protein seqiences Protein domain/family Proteomics (2D gel, MS)
Categories of Biological Databases Microarray Organism-specific 3D structure Metabolism Bibliography Others
Categories of Biological Databases Microarray Organism-specific 3D structure Metabolism Bibliography Others
Biological Databases: special features Autonomous: many independent maintainers Heterogeneous data formats: e.g., various data formats for the same data elements Dynamic: frequent and continous changes in data content (and, more importnatly, in data schema) Broad domain knowledge  Workflow-oriented: databases + rich set of analysis tools Information integration is essential: aggregate data from several databases
Biological Databases: integration Figure is taken from  Bioinformatics: Managing Scientific Data  by Lacroix & Critchlow, p.20

Biological Database Systems

  • 1.
    Biological Database SystemsDenis Shestakov, University of Turku/Tampere
  • 2.
    Course Information Coursestructure: Lectures: approx. 12 (plus today’s intro and review lecture in the end of the course) Project work: details will be given next time Exam: easy to pass if project is done URL:
  • 3.
    Course Information Dates:Period 2: 27.11, 4.12, 11.12 Period 3: 10 meetings on Mondays/Wednesdays Contact info: Email: ICT, B6019: at 15-18 on Tuesdays
  • 4.
    Course Information: LiteratureSlides References in the end of slides Books: Bioinformatics: Managing Scientific Data by Lacroix & Critchlow, Morgan Kaufmann, 2003 ISBN-10: 155860829X Database Systems Concepts, 5 th edition by Silbershatz, Korth & Sudarshan, McGraw-Hill, 2005 ISBN-10: 0072958863 Articles: Biological database design and implementation by Birney & Clamp (the Ensembl project), Briefings in Bioinformatics, 5(1):31-38, 2004
  • 5.
    Biological Database Systems1.1. Course Content 1.2. Course Objectives 1.3. Database and DBMS 1.4. Biological Databases
  • 6.
    Course content: maintopics Database concepts, database design process Relational data model Introduction to SQL XML and XML-based databases Data structures for biological data: storage and querying Model organism databases
  • 7.
    Course content: maintopics LIMS, BioPostgres Analysis workflows, web services Integration of biological data Integration of biological data, example of integration system Research issues in scientific databases * Project discussion, exam preparation
  • 8.
    Course focus Database issues: Biology -specific Representation of biological data Design of biological databases NOT about: Usage of existing databases Accessing/retrieving data from bio-databases
  • 9.
    Course goal Givebasic knowledge of biological* database design * - for molecular biology
  • 10.
    Do you needto know that? Work in “wet” laboratory: One bioinformatician and many biologists Likely to be IT guru for others Expect to answer IT-related questions Work in bioinformatics lab: Many bioinformaticians Group may maintain several dbs Basics are helpful Create/maintain biological databases Start learning! Ask for more information
  • 11.
    Database? From Merriam-Websterdictionary: (http://www.merriam-webster.com/dictionary/database)
  • 12.
    Database? A collectionof data: structured searchable (i.e., indexable) updated cross-referenced Objective: Transform “meaningless” raw data into useful information which can be accessed and analysed in the best way Data b ase Management System (DBMS): software designed for the purpose of managing databases (access, insert, delete, update, etc.)
  • 13.
    DBMS A setof tools that: Store Extract Modify Database Store Extract Modify USERS
  • 14.
    Biological Databases? Explosivegrowth in biological data E.g., tremendous increase in nucleotide sequences (first increase in data due to the polymerase chain reaction (PCR) technique development in 1983) 1980: 80 genes fully sequenced …
  • 15.
    Biological Databases? EMBLDatabase Growth: Total nucleotides (Nov 07: 188,490,792,445 ) Number of entries (Nov 07: 106,144,026 )
  • 16.
    Biological Databases? Data(genomic sequences, 3D structures, 2D gel analysis, microarrays….) directly submitted to databases Essential tools for biological research, like reading relevant literature
  • 17.
    Biological Databases: History1965 Margaret Dayhoff et al. publish “Atlas of Protein Sequences and Structures” 1982 EMBL initiates DNA sequence databases, followed within a year by GenBank and in 1984 by the DNA Database of Japan 1988 EMBL/GenBank/DDBJ agree on common format for data elements
  • 18.
    Biological Databases: somestatistics More than 1000 different databases 968 databases reported in The Molecular Biology Database Collection: 2007 update by Galperin, Nucleic Acids Research, 2007, Vol. 35, Database issue D3-D4 Metabase: database of biological databases, http://biodatabase.org/index.php/Main_Page Database sizes: <100kB to >100GB (EMBL >500GB) DNA: >100GB Protein: 1GB 3D structure: 5GB Update frequency: daily to annyally Freely accessible (as a rule)
  • 19.
    Some databases inthe field of molecular biology AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc … Find more at http://biodatabase.org
  • 20.
    Categories of BiologicalDatabases Nucleotide sequences Genomics Mutation/polymorphism Protein seqiences Protein domain/family Proteomics (2D gel, MS)
  • 21.
    Categories of BiologicalDatabases Microarray Organism-specific 3D structure Metabolism Bibliography Others
  • 22.
    Categories of BiologicalDatabases Microarray Organism-specific 3D structure Metabolism Bibliography Others
  • 23.
    Biological Databases: specialfeatures Autonomous: many independent maintainers Heterogeneous data formats: e.g., various data formats for the same data elements Dynamic: frequent and continous changes in data content (and, more importnatly, in data schema) Broad domain knowledge Workflow-oriented: databases + rich set of analysis tools Information integration is essential: aggregate data from several databases
  • 24.
    Biological Databases: integrationFigure is taken from Bioinformatics: Managing Scientific Data by Lacroix & Critchlow, p.20