Sequence and Structural Databases of DNA and Protein, and its significance in Scientific Researches

S.BITUILA
II MSC.
Sequence and structural databases of Dna
and protein , and its significances in scientific
researches.

DNA Databases:
 Sequence Databases
 Structural Databases

DNA Sequence Databases:
 NCBI
 EMBL
 DDBJ
 Ensembl
 GenBank
 EBI
 UniGene

NCBI (National Centre for
Biotechnological Information)
 Established in the year 1988
 It aims to create public databases , develop
software tools for sequence analysis and
disseminate biomedical information, mainly to
aid the research in computational biology.
 Roles:
-Maintains several biological databases
eg.GenBank,the nucleic acid sequence
database.
-provides data retrieval system (eg.Entrez)
-provides computational resources for the
analysis of GenBank data and a variety of other
biological databases.

Tools available in NCBI:
 BLAST,Entrez,standard
BLAST,megaBLAST, mega BLAST,PSI-
BLAST,RPS-BLAST
 Types of Databases :
-Nucleotide database
-Literature database
-protein database
-Gene expression
-Structural database
-Chemical and others.

EMBL(European Molecular Biology
Laboratory)
 Established in the year 1974 by Leo Sjilard ,
James Watson and John Kendrew.
 Roles:
-Incorporates , Organizes and Distributes
nucleotide sequences from the public sources.
-Performs basic researches in molecular
biology and medicine as well as trains Scientists,
students and visitors.
 Tools:
-Ppsearch,GeneQuiz,FASTA,DALI,BLAST-
2,Radar,Dali-Lite etc.

DDBJ(DNA Databank of Japan)
 Established in the year 1986
 Roles:
-Collects nucleotide sequence data and
provides freely available nucleotide
sequence data.
-Provides supercomputer system to
support research activities in Life Sciences.
Tools:
-Getentry,SRS,TXSearch,LIBRA,GIB.

Ensemble:
 Launched in the year 1999 in response to the imminent
completion of the Human Genome Project.
 Joint Project between the European Bioinformatics
Institute and the welcome Trust Sanger Institute.
 It aims to provide a centralized resource for geneticists,
molecular biologists and other researchers studying the
genomes of our own species and other vertebrates and
model organisms.
 Genome databases for vertebrates and other eukaryotic
species .
 It is one of the well known genome browsers for the
retrieval of genomic information.
 Plays a major role in ENCODE (Encyclopaedia of DNA
Elements Consortium) Project.
 Tools: BLAST ,Data Slicer, Variant Effect Predictor,
Assembly converter etc.

GenBank:
 Started in the year 1982 by Walter Goad and Los
Alamos National Laboratory.
 Produced and maintained by the National Centre for
Biotechnology Information (NCBI) as a part of the
International Nucleotide Sequence Database
Collaboration(INSDC)
 Roles:
-open access ,annotated collection of all publicly
available nucleotide sequences and their protein
translations.
-Provide and encourage access within the scientific
community.
 Tools: Bar S Tool, Sequin, BLAST,

EBI(European Bioinformatics Institute):
 1980
 EMBL-EBI is a centre for research and services in
bioinformatics ,and is a part of European Molecular Biology
Laboratory(EMBL)
 It hosts a number of publicly open ,free to use life sciences
resources ,including biomedical databases, analysis tools
and bio-ontologies which includes-;
- ArrayExpress -archive of gene expression experiments.
- BioModels - a database of computational models relevant
to the life sciences.
- BioStudies -a database that serves as a generic data
archive at EMBL-EBI for biomolecular datasets.
-European Nucleotide Archive (ENA) – resource of
Nucleotide sequencing information.

UniGene:
 It is an NCBI database of the
transcriptome and thus ,despite the name
not primarily a database for genes.
 It provides informations on protein
similarities, gene expression , cDNA
clones and genomic location .

RNase P Database:
 Compilation of RNase P sequences,
sequence alignments , secondary
structures, three dimensional models
and accessory information.
 Also contains secondary structures of
bacterial and archaeal RNAs including
specially annotated ‘reference’
secondary structures of E.Coli and
Bacillus subtilis RNase P RNAs,a
minimum phylogenetic consensus
structure,and coordinates for models
of three-dimensional structure.

Protein Databases:
 Protein Sequence Databases
 Protein Structural Databases

Protein Sequence Databases:
 PIR
 SWISS-PROT
 Trembl
 iProclass
 Pfam

PIR(Protein Information Resource):
 1984 by the National Biomedical Research
Foundation(NBRF)
 Roles: -Source of annotated proteins
database and analysis tools for the
researchers.
 Provides an introduction to a range of
biological database.
 Highlights the distinction between different
data types and indicates where the most
important resources are maintained.
-It also supports genomic and
proteomic research and scientific discovery.

PIR is split into four
sections:
 PIR1: contains fully classified and annotated entries.
 PIR2: includes preliminary entries ,which have not
been thoroughly reviewed and may contain
redundancy .
 PIR3 contains unverified entries ,which have not been
reviewedPIR4 entries fall into one of the four
categories:
-conceptual translations of artefactual
sequences
-conceptual translations of sequences that are
not transcribed or translated
-protein sequences or conceptual translations
that are extensively genetically engineered
-Sequences that are not genetically encoded and
not produced on ribosomes.

SWISS-Prot:
 Founded in the year 1986 by Amos
Bairoch and developed by Swiss
Institute of Bioinformatics and
subsequently developed by Rolf
Apwelier at EBI.
 Provides high level annotations,
including descriptions of the function of
the protein, structure of its domains, its
post translational modifications variants
etc.
 Minimal redundancy and integration with
other databases .

TrEMBL(Translated EMBL)
 Founded in the year 1996 as a
computer annotated supplement
to Swiss-Prot.
 Contains translation of all coding
sequences present in EMBL,
GenBank, DDBJ Nucleotide
Sequence Databases and also
protein extracted from the
literature or submitted to Swiss-
Prot.

iPro-class (Integrated Protein
Knowledge bases)
-First released in 2000
- Provides comprehensive description of a protein
family ,function and structure for Uniprot protein
sequence.
 It contains Value added descriptions of proteins
including family relationship at global and local
levels.
 Serves as a framework for data integration in
distributed networking environment.
 It can also be used to support protein sequence
annotation and genomic/proteomic research to
obtain comprehensive up-to-date information on
proteins.

Uses:
 iPro-class provides two types of protein
sequence reports. In one type it covers
information on genetic gene family structure
function, taxonomy and literature with cross
reference to molecular database .The second
type present PIR super family membership
information with length ,taxonomy and
keyword statistics.
 It also provides links to various molecular
biology databases.

Pfam
 1995 by Erik Sonhammer , Sean Eddy and
Richard Durbin as a collection of commonly
occurring protein domains that could be used to
annotate the protein coding genes of
multicellular animals.
 It is a database of protein families.
 Includes annotations and multiple sequence
alignment of protein families generated using
hidden Markov models.
 The general purpose of Pfam database is to
provide a complete and accurate classification of
protein families.
 This method has been widely adopted by
biologists because of its wide coverage of
proteins and sensible naming conventions.

Uses :
 It is used by experimental biologists
researching specific proteins ,by structural
biologists to identify new targets for
structure determination, by computational
biologists to organize sequences and by
evolutionary biologists for tracing the
origins of proteins.
 It also allows users to submit protein or
DNA sequences to search for matches to
families in the database.

Structural Databases of protein
;
 PDB
 CATH
 SCOP
 Gene 3D
 D Bali
 E-MSD

PDB(Protein DataBank);
 1971, by Brookhaven National Laboratory ,New
York.
 It is a database for the three –dimensional structural
data of large biological molecules, and nucleic acids.
 Roles:
-It is a key resource in areas of structural
biology ,such as structural genomics .
-Provides protein structures to many other
databases eg SCOP and CATH.
 Tools:
-ADIT(auto Deep Input Tool), pdb-Extract,
OOSTAR, Open Ras Mol, CIF Tr, MAXIT, Biopython,
mmLIB,XML2PDB,

CATH( Class, Architecture, Topology
and Homology)
 Mid 1990s by Professor Christine Orengo and colleagues including
Janet Thornton and David Jones at the University College London.
-It is a protein Structure Classification Database. and shares many
broad features with the SCOP resource.
-It provides information on the Evolutionary relationships of protein
domains .
 Roles:
-Class; at this level the domains are assigned according to their
secondary structure content .
-Architecture , at this level , information on the secondary structure
arrangement in three dimensional space is used for assignment. It
describes the gross secondary structure content and packing.
-Topology encompasses both overall shape and connectivity of
secondary structure
-Homology groups domains that share more than 35% sequence
identity and thought to share a common ancestor.

The four levels of CATH hierarchy:
# Level Description
1. Class: The overall secondary structure
content of the domain .
2. Architectur
e:
High structural similarity but no
evidence of homology .
3. Topology: A large-Scale grouping of
topologies which share
particular structural features
4. Homolog-
ous
superfam-
ily
Indicative of a demonstrable
evolutionary relationship

SCOP( Structural Classification of
Protein)
 1994
 Centre for Engineering and the Laboratory of Molecular
Biology.
 Roles:
-Describes Structural and Evolutionary relationship
between proteins of known structure.
-Provides broad survey of all known proteins folds ,
detailed information about the close relatives of protein
and a protein and a framework for future research and
classification.

E-MSD
 1996
 Provides clean Macromolecular Structure
Data
 Accept and process depositions to the PDB.
 Transform the PDB flat –file archive to a
relational database system.
 Management and distribution of data on
molecular structures in close collaboration
with PDB.
 Tools- Autodep and Emdep

Gene 3D:
 Provides structural annotation for proteins
in the CATH sequence database.
 It uses the information in CATH to predict
the locations of structural domains on
millions of protein sequences available in
public databases.
 Provides comprehensive structural and
fuctional annotation of most available
protein sequence including the Uniprot,
Refseq and Integr 8 resources.

References:
-Bioinformatics by Sabu M Thampi
-Bioinformatics by Dardel
-Bioinformatics for Biologists by Dr. Murtada
Alshareifi
-https://bioinf.comav.upv.es

Sequence and Structural Databases of DNA and Protein, and its significance in Scientific Researches

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sequence and Structural Databases of DNA and Protein, and its significance in Scientific Researches

Similar to Sequence and Structural Databases of DNA and Protein, and its significance in Scientific Researches (20)

Recently uploaded

Recently uploaded (20)

Sequence and Structural Databases of DNA and Protein, and its significance in Scientific Researches