Database in bioinformatics

Database
 A Computerized archive used to store and organize data in such
a way that information can be retrieved easily.
 A database is a repository of information that has a specific
structure that enables the entering and extraction of data
 In general this database structure consists of files or tables,
 each containing numerous records and fields

Conti..
 Database System (DBS) is an integrated collection of related files
along with the detail about their definition, interpretation,
manipulation and maintenance
 A database system is based on the data. Also a database system can
be run or executed by using software called DBMS (Database
Management System).
 A database system controls the data from unauthorized access.
 A database management system (DBMS) is a collection of programs
that enables users to create and maintain a database.

Database management systems
 Database management systems provide several functions in
addition to simple file management:
 control security
 maintain data integrity
 provide for backup and recovery
 control redundancy
 allow data independence
 provide non-procedural query language
 perform automatic query optimization

Organisation
 Organisation:
 flat files
 Relational databases
Flat-file databases
 the simplest form of a database,
 where collections of data, such as nucleotide and amino
acid sequence, are stored as either a large single text file

Conti..
 a database that treats all of its data as a collection of
relations
 A relational database stores the data within a number of
tables.
 Each table consists of records and fields (rows and
columns)

Types of Database
 The databases can be classified into three
categories on the basis of the information
stored.
 They are Primary, Secondary and
Composite databases.
 Primary databases contain data that is
derived experimentally.
 They usually store information related to
the sequences or structures of biological
components
 They can be further divided into protein or
nucleotide databases

Primary Database
 This databases contains the raw nucleic acid sequence data
which are produced and submitted by researchers worldwide.
 NCBI(The National Centre for Biotechnology Information)
 GenBank
 DDBJ (DNA data bank of Japan)
 SWISS-PROT(Swiss-Prot )
 PIR (Protein Information Resource)
 PDB(Protein Data Bank)
 TrEMBL (Translated European Molecular Biology Laboratory)
Protein
PIR
MIPS
SWISS-PROT
TrEMBL

Secondary Databases
Secondary Databases:
 contain information derived from primary databases.
 store information such as conserved sequences, active
site residues, and signature sequences. Protein
Databank data is stored in secondary databases.
Examples include:
 Class Architecture Topology Homology (CATH),
 Kyoto Encyclopedia of Genes and Genomics (KEGG),
 Protein Families (Pfam)
 and Structural Classification of Proteins (SCOP)

Composite Databases
Composite Databases
 are collections of several primary database resources.
 provide users with various tools and software for analysis of data.
 NCBI being a composite database has stored a lot of sequence of
nucleotide and protein within its server and thereby suffers from
high redundancy in the data deposited

Biological databases
 Biological databases can be broadly classified in to
 Sequence database
 structure database
 and pathway databases.
 Sequence databases are applicable to both nucleic acid sequences
and protein sequences, whereas structure databases are applicable
to only Proteins.

Sequence databases
Sequence databases
 Nucleotide and protein sequence databases represent the most
widely used and some of the best established biological
databases.
 serve as repositories for wet lab results and the primary source
for experimental results.
 Major public data banks included in this type are
 GenBank in USA,
 EMBL (European Molecular Biology Laboratory) in Europe
 and DDBJ (DNADataBank) in Japan

Conti….
 And protein databases includes
 ExPaSy
 UniProt
 PIR
 PDB
 Swiss-Prot
 TrEMBL

NATIONAL CENTER FOR BIOTECHNOLOGY
INFORMATION (NCBI)
 developed at the National Institutes of Health (NIH) in 1988
 Part of national library of medicine at national institute of
health
 provides access to a large amount of biomedical and genomic
information (www.ncbi.nlm.nih.gov/home/
about/mission.shtml).
 It maintains a large scale of databases and bioinformatics
tools as well as services.
 One of the most popular databases is GenBank

Conti…
Mission or role
 The aim is to find novel techniques and methodologies for dealing
with huge and complex data
 and provide better accessibility to analytical and computational
tools.
 Maintenance of biological databases whether primary or
secondary.
 It includes GENEBANK
 NCBI provides the data retrieval systems such as ENTREZ
 Provides computational sources for the analysis of the GENEBANK
data and other biological data

Conti…
Resources
 The resources that are present on this site can be divided
into two major categories:
 1) databases
 2) tools

 The major databases maintained at NCBI are
 GenBank and PubMed (bibliographic database for biomedical literature).
 Other databases include the
 Gene,
 Genome,
 Epigenomics,
 Gene
 Expression
 RefSeq,
 Structure, Database of Short Genetic Variation (dbSNP),
 TAXONOMY, etc.

TOOLS at NCBI
 The NCBI also provides a variety of tools for database search
 The Entrez: is search engine of NCBI
 The other tools include
 Genomes Browser,
 BLAST,
 CDTree,
 Genetic Codes,
 Open Reading Frame Finder (ORF Finder),
 SNP Database Specialized Search Tools,

GenBank
 GenBank (Genetic Sequence Databank)
 GenBank® is the genetic sequence database at the National Center for
Biotechnology Information (NCBI).
 It was established in the year 1982 and now maintained by the
National Center for Biotechnology (NCBI).
 It contains publicly available nucleotide sequences
 DNA sequences can be submitted to GenBank using several different
methods.
 BankIt: Web-based form for submission of a small number of
sequences
 Sequin: More appropriate for complicated submissions containing
many sequences

Structure of Genbank
 A detailed structure of a nucleotide
sequence file format in this database
includes the following:
 1. Locus: This can be defined as a title
given by GenBank itself to name the
sequence entry. It includes the
following:
 a. Locus Name: Similar to accession
number for the sequence.
 b. Sequence Length: Tells the number
of bases existing in the sequence.

Conti….
 c. Molecule-Type: Identifies the
type of nucleic acid sequence.
The various types are mRNA
(which is present as cDNA), rRNA,
snRNA, and DNA.
 d. GB Division: Postulates class of
the data according to
classification criteria of GenBank.
 e. Modification Date: The date on
which the record was modified.

 2. Definition: This denotes the name of the
nucleotide sequence.
 3. Accession: This covers accession number,
accession version, and GI number.
 Accession number can be defined as the
unique identifier associated with each
nucleotide sequence present in the
database.
 4. VERSION - Identification number assigned
to a single, specific sequence in the
database. This number is in the format
“accession.version.”
 5. GI Also a sequence identification
number. Whenever a sequence is changed,
the version number is increased and a new
GI is assigned.

 6. Keyword: Defined words that
were used to index the entries.
 7. The Source: This describes
organism from which sequences
have been obtained.
 8. Organism - The scientific name
(usually genus and species) and
phylogenetic lineage
 9. REFERENCE - Citations of
publications by sequence authors,
the journal from which with the
sequence was derived

 10. Features: These
consist of the
information derived
from the sequence
such as biological
source,
 exon,
 intron,
 promoters,
 CDS
 alternate splice,
 Base Count,
 Origin

European Molecular Biology Laboratory
(EMBL)
 The EMBL Nucleotide Sequence Database is maintained by EBI,
UK
 It was formed in the year 1974
 It develops and maintains a large number of databases, and
scientists can access the data free of cost.
 This database serves as the primary source of nucleotide
sequences for Europe.
 in this database, the nucleotide sequence data generated by
large-scale genome-sequencing projects and those available
from the European Patent Office can be submitted

Conti…
 Data collection is done in collaboration with GenBank
(USA) and the DNA Database of Japan (DDBJ).
 The other genomic databases held at EBI are
 Ensembl (a database of genome annotation)
 Genome Reviews.
 The daily releases of the database contain new
submissions and updated sequence data
 while every 3 months the entire database is released.

DDBJ
 DDBJ: DNA Data Bank of Japan Is a biological database
that collects DNA sequences submitted by researchers.
 It is run by the National Institute of Genetics, Japan.
DDBJ Flat File Format
 The data submitted in DDBJ is managed and retrieved
according to the DDBJ format (flat file).
 The flat file includes the sequence and the information of
who submitted the data, references, source organisms,
and information about the feature, etc

Ensembl Genome Database
 Ensembl is one of several well known genome browsers for the
retrieval of genomic information from several organisms
including human, plants, bacteria and animals.
 Created and maintained by the EBI and the Sanger Center (UK)

databases for green plants
 There are three different comparative genomic databases
for green plants, namely,
 GreenPhylDB,
 Plaza,
 Phytozome
 These databases aim to support studies on genomics
studies related to plant evolution and
 to provides comparative data on genomes and gene
families and the tools for their analysis.

Conti…..
 It provides information on
 genomic context of plant genes,
 Gene homologues, and paralogues,
 RNA transcripts from the given genes,
 peptide sequences, and
 functions of gene families.
 It allows access to complete genome sequences available in the
database.

Protein Databases
Swiss-Prot
Swiss-Prot is a protein sequence and knowledge database.
 It is well known for high quality of annotation, use of
standardized nomenclature, and links to specialized databases.
 its repository contains the amino acid sequence, the protein
name and description, taxonomic data, and citation information
PFAM
 A database of protein families, Pfam contains annotations as
well as multiple sequence alignments generated using hidden
Markov models

Conti…
 TrEMBL: The European Bioinformatics Institute, collaborating with
Swiss-Prot, introduced another database, TrEMBL (translation of EMBL
nucleotide sequence database)
 This database consists of computer annotated entries obtained from
the translation of all coding sequences in the nucleotide databases.
 PIR: The Protein Information Resource (PIR) is an integrated public
bioinformatics resource that supports genomic and proteomic
research and scientific studies
 The PIR serves the scientific community through on-line access, and
performing off-line sequence identification services for researchers.
 It is a database of freely accessible protein sequences which contains
high-quality data and functional information for the proteins

Structure databases
There are many structural database that include
Protein DataBank (PDB)
 Important in solving real problems in molecular biology
 PDB Established in 1972 at Brookhaven National
Laboratory (BNL)
 It contains structural information of the macromolecules
determined by X-ray, crystallographic, NMR methods
 PDB is maintained by the Research Collaboratory for
Structural Bioinformatics (RCSB).

Conti…
 PROSITE: is a database of protein domains and families.
 PROSITE contains biologically significant sites, patterns
and profiles that help to reliably identify to which known
protein family a new sequence belongs.
 CATH: The CATH database (Class, architecure, topology,
homologous superfamily) is a hierarchical classification of
protein domain structures, which clusters proteins at four
major structural levels.

Pathway databases
 Pathway databases
 A pathway database (DB) is a DB that describes
biochemical pathways, reactions, and enzymes
 Some examples of the pathway databases are
 KEGG (The Kyoto Encyclopedia of Genes and Genomes)
 BRENDA,
 Biocyc.

Conti…
 KEGG: The Kyoto Encyclopedia of Genes and Genomes (KEGG) is the
primary resource for the Japanese Genome Net service
 it is a collection of online databases dealing with genomes, enzymatic
pathways, and biological chemicals
 KEGG contains three databases: PATHWAY, GENES, and LIGAND.
 The PATHWAY database stores computerized knowledge on molecular
interaction networks.
 The GENES database contains data concerning sequences of genes and
proteins generated by the genome projects.
 The LIGAND database holds information about the chemical compounds and
chemical reactions that are relevant to cellular processes.

 BioCyc: The BioCyc Database Collection is a compilation of
 pathway and genome information for different organisms.
 It includes two other databases,
 EcoCyc which describes Escherichia coli K-12;
 MetaCyc, which describes pathways for more than 300
organisms.

Database in bioinformatics

More Related Content

What's hot

Similar to Database in bioinformatics

More from VinaKhan1

Recently uploaded

Database in bioinformatics