BIOLOGICAL DATABASE
By
KAUSHAL KUMAR SAHU
Assistant Professor (Ad Hoc)
Department of Biotechnology
Govt. Digvijay Autonomous P. G. College
Raj-Nandgaon ( C. G. )
INTRODUCTION
HISTORY
WHAT ARE THE DATABASE…?
WHY DATABASE….?
THE “PERFECT” DATABASE
IDENTIFIERS and ACCESSION NUMBER
TECHNICAL DESIGN
MAINTAINANCE OF BIOLOGICAL DATABASES..
GENERAL FEATURES
SOURCES OF BIOLOGICAL DATA…
DIFFERENT TYPES OF BIOLOGICAL DATABASE
FUNCTION
DATA ENTRY AND QUALITY CONTROL
AVAILIBILITY
APPLICATION
DATA RECORD AT THE YEAR 2004
CONCLUSION
REFFERENCES
SYNOPSIS
Biological databases are libraries of life
sciences information, collected from
scientific experiments, published literature,
high-throughput experiment technology,
and computational analyses. They contain
information from research areas.
Including—genomics, proteomics
,metabolomics, microarray gene
expression etc.
HISTORY
 By Margaret Dayhoff in 1965, who developed a first
protein sequence database called Atlas of Protein
Sequence and Structure.
 The first protein structure prediction algorithm was
developed by Chou and Fasman in 1974.
 The 1980s saw the establishment of GenBank and
the development of fast database searching
algorithms such as FASTA by William Pearson and
BLAST by Stephen Altschul and coworkers.
WHAT ARE THE DATABASES……???
Protein structures
-Experiments
-Models (homologues)
Literature information
Original DNA Sequences
(Genomes)
Protein Sequences
-Inferred
-Direct sequencing
Expressed DNA sequences
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tags
(ESTs)
6
Why Databases….?
THE “PERFECT” DATABASE
1. Comprehensive, but easy to search
2. Annotated, but not “too annotated”
3. A simple, easy to understand structure
4. Cross-referenced
5. Minimum redundancy
6. Easy retrieval of data
IDENTIFIERS and ACCESSION NUMBER
 Identifier: string of letters and digits that generally is
“understandable”
 Example: TPIS_CHICK (Triose Phosphate Isomerase from
chicken (gallus gallus) ) in SwissProt
 Accession code: a string of letters and digits that
uniquely identifies an entry in its database.
 The accession number for TPIS_CHICK in SwissProt is
P00940
TECHNICAL DESIGN
 Flat-files
 Relational database (SQL)
 Object-oriented database
ALL NAME AND INFORMATION IN PRESENT IN THIS
FILE---like name, subject name, subject number etc.
1.Flat-files
STU
DEN
T NAME STATE
STU
DEN
T SUBJECT SUBJECT SUBJECT NAME
1 NEERAJ SHIMLA 1 GENETICH777 GENETIC777 GENETIC ENGINEERING777
2 ADITYA CHHATTISGARH 2 MOLBIO654 MOLBIO654 MOLECULAR BIOLOGY654
3 AMIT KASHMIR 3 MICRO615 MICRO615 MICROBIOLOGY615
4 BHARTI BILASHPUR 4 BIOCHE575 BIOCHE575 BIOCHEMISTRY575
5 RUCHI MAHASAMUND 5 INSTRU551 INSTRU551 INSTRUMENT551
6 SUNAINA RAIGARH 6 BIOSTA544 BIOSTA544 BIOSTATISTICS544
7 ARCHANA JAGADALPUR 7 ENVIR541 ENVIR541 ENVIRONMENTAL 541
Relation= Table
Consists of heading (a
fixed set of attributes)
Attribute Tuple
Primary key= Unique identifier Attribute
or combination of attributes that uniquely
identifies each tuple
2.Relational database (SQL).
3.Object-oriented database(hierarchical relationships
between data items)
MAINTAINANCE OF BIOLOGICAL
DATABASES..
 Large, public institution funded by government
(EMBL, NCBI).
 Quasi-academic institute (Swiss Institute of
Bioinformatics, TIGR).
 Academic group or scientist.
 Commercial company.
 TCAG(2017)
Biological databases are an important tool
in assisting scientists to understand and
explain a host of biological phenomena
Biological knowledge is distributed
different general and specialized
databases
SOURCES OF BIOLOGICAL DATA…
GenBank
Sequencing
Centers
TATAGCCG TATAGCCGTATAGCCG TATAGCCG
Labs
Algorithms
UniGene
Curators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
DIFFERENT TYPES OF BIOLOGICAL
DATABASE
 Nucleotide sequences
 Protein sequences
 Genome database
 Protein Structural Database
 Protein Structural classification Databases
 Micro array and gene expression database
 Immunological database
 Metabolic pathway Databases
Nucleotide
sequences
10,378,022
11,302,156,937
IMGT
MHCPEP
Immunological
database
in 1965.
FUNCTION
 Make biological data available to scientists
 Consolidation of data (gather data from different sources)
 Provide access to large dataset that cannot be published
explicitly (genome, …)
 Make biological data available in computer-readable format
 Make data accessible for automated analysis
Data entry and quality control
 Scientists (teams) deposit data directly.
 Appointed curators add and update data.
 Are erroneous data removed or marked?
 Type and degree of error checking.
 Consistency, redundancy, conflicts, updates.
Availability
 Publicly available, no restrictions.
 Available, but with copyright.
 Accessible, but not downloadable.
 Academic, but not freely available.
 Proprietary, commercial; possibly free for academics.
APPLICATION
 Sequence comparison
 Evolutionary relationship between genes
 Gene expression comparison
 Primer designing
DATA RECORD AT THE YEAR 2004
Nucleotide records 36,653,899
Protein sequences 4,436,362
3D structures 19,640
Interactions & complexes 52,385
Human Unigene Cluster 118,517
Maps and Complete Genomes 6,948
Different taxonomy Nodes 283,121
Human dbSNP 13,179,601
Human RefSeq records 22,079
bp in Human Contigs > 5,000 kb (116) 2,487,920,000
PubMed records 12,570,540
OMIM records 15,138
~11,000/day by different sources
ESSENTIAL BIOINFORMATICS--------JIN XIONG
Dr. Jayaram Reddy, Centre for Molecular and computational
Biology, St. Joseph’s College, Bangalore (pdf)
Computational Biology Service Unit Cornell University---Qi
Sun (pdf)
francis@bioinformatics.ubc.ca
Wikipedia
www.ncbi.nlm.nih.gov/

Biological database by kk sahu

  • 1.
    BIOLOGICAL DATABASE By KAUSHAL KUMARSAHU Assistant Professor (Ad Hoc) Department of Biotechnology Govt. Digvijay Autonomous P. G. College Raj-Nandgaon ( C. G. )
  • 2.
    INTRODUCTION HISTORY WHAT ARE THEDATABASE…? WHY DATABASE….? THE “PERFECT” DATABASE IDENTIFIERS and ACCESSION NUMBER TECHNICAL DESIGN MAINTAINANCE OF BIOLOGICAL DATABASES.. GENERAL FEATURES SOURCES OF BIOLOGICAL DATA… DIFFERENT TYPES OF BIOLOGICAL DATABASE FUNCTION DATA ENTRY AND QUALITY CONTROL AVAILIBILITY APPLICATION DATA RECORD AT THE YEAR 2004 CONCLUSION REFFERENCES SYNOPSIS
  • 3.
    Biological databases arelibraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses. They contain information from research areas. Including—genomics, proteomics ,metabolomics, microarray gene expression etc.
  • 4.
    HISTORY  By MargaretDayhoff in 1965, who developed a first protein sequence database called Atlas of Protein Sequence and Structure.  The first protein structure prediction algorithm was developed by Chou and Fasman in 1974.  The 1980s saw the establishment of GenBank and the development of fast database searching algorithms such as FASTA by William Pearson and BLAST by Stephen Altschul and coworkers.
  • 5.
    WHAT ARE THEDATABASES……??? Protein structures -Experiments -Models (homologues) Literature information Original DNA Sequences (Genomes) Protein Sequences -Inferred -Direct sequencing Expressed DNA sequences ( = mRNA Sequences = cDNA sequences) Expressed Sequence Tags (ESTs)
  • 6.
  • 7.
    THE “PERFECT” DATABASE 1.Comprehensive, but easy to search 2. Annotated, but not “too annotated” 3. A simple, easy to understand structure 4. Cross-referenced 5. Minimum redundancy 6. Easy retrieval of data
  • 8.
    IDENTIFIERS and ACCESSIONNUMBER  Identifier: string of letters and digits that generally is “understandable”  Example: TPIS_CHICK (Triose Phosphate Isomerase from chicken (gallus gallus) ) in SwissProt  Accession code: a string of letters and digits that uniquely identifies an entry in its database.  The accession number for TPIS_CHICK in SwissProt is P00940
  • 9.
    TECHNICAL DESIGN  Flat-files Relational database (SQL)  Object-oriented database ALL NAME AND INFORMATION IN PRESENT IN THIS FILE---like name, subject name, subject number etc. 1.Flat-files
  • 10.
    STU DEN T NAME STATE STU DEN TSUBJECT SUBJECT SUBJECT NAME 1 NEERAJ SHIMLA 1 GENETICH777 GENETIC777 GENETIC ENGINEERING777 2 ADITYA CHHATTISGARH 2 MOLBIO654 MOLBIO654 MOLECULAR BIOLOGY654 3 AMIT KASHMIR 3 MICRO615 MICRO615 MICROBIOLOGY615 4 BHARTI BILASHPUR 4 BIOCHE575 BIOCHE575 BIOCHEMISTRY575 5 RUCHI MAHASAMUND 5 INSTRU551 INSTRU551 INSTRUMENT551 6 SUNAINA RAIGARH 6 BIOSTA544 BIOSTA544 BIOSTATISTICS544 7 ARCHANA JAGADALPUR 7 ENVIR541 ENVIR541 ENVIRONMENTAL 541 Relation= Table Consists of heading (a fixed set of attributes) Attribute Tuple Primary key= Unique identifier Attribute or combination of attributes that uniquely identifies each tuple 2.Relational database (SQL). 3.Object-oriented database(hierarchical relationships between data items)
  • 11.
    MAINTAINANCE OF BIOLOGICAL DATABASES.. Large, public institution funded by government (EMBL, NCBI).  Quasi-academic institute (Swiss Institute of Bioinformatics, TIGR).  Academic group or scientist.  Commercial company.  TCAG(2017)
  • 12.
    Biological databases arean important tool in assisting scientists to understand and explain a host of biological phenomena Biological knowledge is distributed different general and specialized databases
  • 13.
    SOURCES OF BIOLOGICALDATA… GenBank Sequencing Centers TATAGCCG TATAGCCGTATAGCCG TATAGCCG Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters
  • 14.
    DIFFERENT TYPES OFBIOLOGICAL DATABASE  Nucleotide sequences  Protein sequences  Genome database  Protein Structural Database  Protein Structural classification Databases  Micro array and gene expression database  Immunological database  Metabolic pathway Databases
  • 15.
  • 17.
  • 18.
  • 19.
    FUNCTION  Make biologicaldata available to scientists  Consolidation of data (gather data from different sources)  Provide access to large dataset that cannot be published explicitly (genome, …)  Make biological data available in computer-readable format  Make data accessible for automated analysis
  • 20.
    Data entry andquality control  Scientists (teams) deposit data directly.  Appointed curators add and update data.  Are erroneous data removed or marked?  Type and degree of error checking.  Consistency, redundancy, conflicts, updates.
  • 21.
    Availability  Publicly available,no restrictions.  Available, but with copyright.  Accessible, but not downloadable.  Academic, but not freely available.  Proprietary, commercial; possibly free for academics.
  • 22.
    APPLICATION  Sequence comparison Evolutionary relationship between genes  Gene expression comparison  Primer designing
  • 23.
    DATA RECORD ATTHE YEAR 2004 Nucleotide records 36,653,899 Protein sequences 4,436,362 3D structures 19,640 Interactions & complexes 52,385 Human Unigene Cluster 118,517 Maps and Complete Genomes 6,948 Different taxonomy Nodes 283,121 Human dbSNP 13,179,601 Human RefSeq records 22,079 bp in Human Contigs > 5,000 kb (116) 2,487,920,000 PubMed records 12,570,540 OMIM records 15,138 ~11,000/day by different sources
  • 24.
    ESSENTIAL BIOINFORMATICS--------JIN XIONG Dr.Jayaram Reddy, Centre for Molecular and computational Biology, St. Joseph’s College, Bangalore (pdf) Computational Biology Service Unit Cornell University---Qi Sun (pdf) francis@bioinformatics.ubc.ca Wikipedia www.ncbi.nlm.nih.gov/