DATABASES...............................pptx

DATABASE
 Information available and related to a particular topic or subject is called as data.
 A database is a computerized archive used to store and organize data in such a way that information can be
retrieved easily via a variety of search criteria.
 Computerized databases offer many facilities and utilities:
 It is easy to search and obtain required information.
 Redundancy of data can be reduced. This also avoids inconsistencies in the data, since any change to the data
need not be carried out at several places in the database.
 The data can be shared more easily because a database may be accessed by several users simultaneously.
 The data can be authenticated and standards can be enforced more easily.
2

BIOLOGICAL DATABASE
A collection of biological data arranged in computer readable form that enhances the speed of search and retrieval
and is convenient to use is called a biological database. A range of information collected from scientific
experiments, published literature, information regarding biological sequences, structures, binding sites, metabolic
interactions, functional relationships, protein families, motifs (a short conserved region in a DNA sequence or
protein) and homologs (biological molecules related to one another by divergent evolution from a common
ancestor) etc., can be retrieved from these databases. They link knowledge obtained from various fields of biology
and medicine.
 Biological databases are of the following types:
1. Primary database
2. Secondary database
3. Composite database
3

PRIMARY DATABASES
Primary databases store raw experimental data and
contain only sequence or structure information. The
different types of primary databases are
4

1. Primary nucleic acid databases
 They hold the experimentally determined nucleotide sequence information, together with the protein
sequence inferred from the conceptual translation of these nucleotide sequences.
 These are sequences submitted directly by scientists and genome sequencing groups, and sequences
taken from literature and patents.
 The three primary nucleotide sequence databases are the Nucleotide Sequence Database maintained
by EMBL, GenBank and DDBJ . These three comprise the International Nucleotide Sequence
Database Collaboration.
 Database entries are exchanged on a daily basis between these three primary nucleotide databases and
hence the three function as a virtually unified db called INSD- International Nucleotide Sequence
Database.
 These databases can be used without any legal restrictions.
5

a) GenBank
 Is a public db of all known nucleotide and protein sequences with supporting bibliographic and
biological annotation.
 Is built and maintained by NCBI.
 Besides sequence data GenBank files contain information such as accession numbers, gene names,
phylogenetic classification and references to published literature.
 Data may be submitted using BankIt- a www-based submission tool, Sequin – NCBI’s stand-alone
submission software or using Barcode Submission Tool- a web-based submission tool.
 Retrieval of data is through the Entrez System- a db retrieval system that helps access the db entries.
6

b) EMBL (European Molecular Biology Laboratory)
 Constitutes Europe’s primary nucleotide seq. resource.
 The data originates from a combination of large-scale genome sequencing projects, direct submissions
from individual scientists and the European Patent Office.
 There is a quarterly release of the whole database while new and updated records are distributed daily.
 EMBL db entries are grouped into divisions based mainly on taxonomy with a few exceptions like the
new HTG (High-Throughput Genome Sequences) and GSS ( Genome Survey Sequences) divisions, for
which grouping is based on the specific nature of the underlying data. Thus divisions provide subsets of
the database which reflect the areas of interest of many users. The EMBL db currently consists of 17
divisions with each entry belonging to exactly one division.
 The database can be accessed or sequences can be retrieved via the EBI SRS server (Sequence
Retrieval System) or the FTP server or using the Dbfetch (database fetch) – a tool for simple sequence
retrieval via http.
7

c) DDBJ (DNA Data Bank of Japan)
 Is the only nucleotide sequence databank in Asia certified to collect nucleotide sequences from
researchers and to issue the internationally recognized accession number to data submitters.
 It collects sequence data mainly from Japanese researchers.
 The principle purpose of DDBJ operations is to improve the quality of INSD i.e. when researchers
make their data open to public through INSD, scientists at DDBJ make efforts to describe
information on the data as rich as possible, according to the unified rules of INSD.
 For submitting their data, Japanese genome teams use mass submission tool –MST.
8

2. PRIMARY PROTEIN SEQUENCE
DATABASES
They contain entries which describe protein domains, families and
functional sites. They also contain associated patterns and profiles to
identify protein domains and families.
Swiss-Prot, TrEMBL (translated EMBL) and PIR (Protein
Information Resource) are the primary protein databases and are
different from the nucleotide databases. These databases are curated, i
e., they are created and maintained by groups of scientists.
9

Swiss-Prot
 Swiss-Prot tries to provide a high level of annotation (such as the description of the function of a
protein, its domain structure, post translational modifications, variants etc) and a minimum level of
redundancy. It has a high level of integration with other databases.
 The Swiss-Prot entry contains large number of annotations. Each line begins with two letters, many of
which are self-explanatory. Eg. ID (identity), AC (accession number), DT (date), DE (description), GN
(gene name), CC (comment) etc..
 Swiss-Prot not only presents a fairly comprehensive description of the protein and its functions but also
provides cross references to the relevant entries in the secondary databases like PROSITE, PRINTS,
Pfam, etc..
 The Swiss-Prot database has some legal restrictions. The entries themselves are copyrighted, but freely
accessible and usable by academic researchers. Commercial companies must pay a license fee to use
Swiss-Prot.
10

TrEMBL
 TrEMBL is a computer annotated supplement of Swiss Prot and contains all the
translations of the EMBL sequence entries that are not yet integrated in Swiss-Prot.
The annotation of an entry in TrEMBL has not reached the standards required for
inclusion into Swiss-Prot. As further data ensure the reliability of annotations,
TrEMBL entries are moved to Swiss-Prot.
 Swiss-Prot and TrEMBL are developed by the Swiss-Prot groups at Swiss
Institute of Bioinformatics (SIB) and at European Bioinformatics Institute (EBI).
11

PIR
 PIR is a protein sequence database of functionally annotated protein sequences. It tries to be
comprehensive, well organised, accurate and consistently annotated. It does not reach the level of
completeness in entry annotation as does Swiss- Prot.
 It is a division of NBRF (National Biomedical Research Foundation) in the US
 It has collaborated with EBI and SIB to establish the UniProt (universal protein database), that provides a
single, centralised, authoritative resource for protein sequences and functional information.
 PIR also produces the NRL-3D -a database of sequences extracted from the 3D structures in the PDB.
The NRL 3D database makes the sequence information in PDB available for similarity searches and
retrieval and provides cross reference information for use with other PIR protein sequence databases.
 The Swiss-Prot and PIR overlap extensively but there are still many sequences which can be found only
in one.
12

3. PRIMARY STRUCTURE
DATABASE
 They pertain to macromolecular structure and store data on
protein and nucleic acid structure. The primary resource for
protein structure data is the Protein Data Bank (PDB). It is
the worldwide archive of structural data maintained by the
Research Collaboratory for Structural Bioinformatics
(RCSB), at Rutgers University. The associated Nucleic acid
Data Bank (NDB) is also maintained here.
13

 It is the main primary database for 3D structures of biological macromolecules.
 Data from X-ray crystallography and NMR spectroscopic studies are deposited in the PDB
(using a web-based interface called AutoDep Input Tool). The data are extensively checked and
verified by human curators before acceptance.
 It also accepts experimental data used to determine the structures and homology models.
 PDB entries contain atomic coordinates, and some structural parameters connected with atoms.
 PDB entries are annotated but are not as comprehensive as in Swiss-Prot
 There are no legal restrictions on the use of PDB.
 It was established in 1970 at the Brookhaven lab New York, US. It is maintained by RCSB
(Research Collaboratory for Structural Bioinformatics).
14

 Secondary databases
are databases having information derived from the data in the primary database. They
consolidate, summarise, standardise, classify, index and comment on primary databases. These
are very important for inferring protein function. Examples are PROSITE, PRINTS, BLOCKS,
etc..
 Composite databases
Amalgamates the information held in two or more of the primary databases. This means that
only one database needs be searched rather than do multiple searches on individual primary
dbs.
Eg: OWL- SwissProt, PIR, GenPept and NRL3D
NRDB- SwissProt and TrEMBL.
15

 Organism specific databases
Contain information, links and resources dedicated to particular species. They contain information
on sequence data, gene expression, mutant phenotypes, genome maps, genome sequencing projects
and relevant scientific literature and provide links to resources for obtaining clones, mutants as well
as for contacting researchers.
Eg. EcoGene – database for E.coli, Mouse Genome Database (MSD) for mouse, OMIM (Online
Mendilian Inheritance in Man)
 Specialised sequence databases
These databases have particular types of nucleic acid or protein sequences deposited in them. For
example, there are databases specifically for rRNA and tRNA sequences.
16

 Commercial databases
Unlike public databases which can be accessed freely by anyone using the WWW, commercial databases
require subscription as they are the result of a single company’s research and investment.
Eg. Incyte, UniGene etc.
 Literature databases
A literature database contains the abstracts and in some cases, the full text and figures of published articles.
Such databases can be searched using text strings to find words in the title, abstract, keywords, or by author
or author’s institution. Medline was one of the earliest comprehensive online library resources. It has now
been incorporated into a large resource called PubMed maintained by the NCBI. Other examples are the
Web of Science and BioMedNet.
17

DATABASES...............................pptx

Recommended

Recommended

More Related Content

Similar to DATABASES...............................pptx

Similar to DATABASES...............................pptx (20)

More from Cherry

More from Cherry (20)

Recently uploaded

Recently uploaded (20)

DATABASES...............................pptx