2. With the availability of over 165
completed genome sequences from both
eukaryotic and prokaryotic organisms,
efforts are now being focused on the
identification and functional analysis of
the proteins encoded by these genomes.
The large-scale analysis of these proteins
has started to generate huge amounts of
data due to the new information provided
by the genome projects and to a range of
new technologies in protein science.
INTRODUCTION
3. For example, mass spectrometry approaches are
being used in protein identification and in
determining the nature of post-translational
modifications. These and other methods make it
possible to quickly identify large numbers of
proteins, to map their interactions, to
determine their location within the cell and to
analyze their biological activities.
Protein sequence databases play a vital role as
a central resource for storing the data
generated by these and more conventional
efforts, and making them available to the
scientific community.
4. Universal protein databases cover proteins
from all species.
Whereas specialized data collections contain
information about a particular protein family
or group of proteins, or related to a specific
organism.
Universal protein sequence databases can be
further subdivided into two categories:
sequence repositories (depositories), in
which data are stored with little or no
manual intervention in the creation of the
records.
And expertly curated databases, in which the
original data are enhanced by the addition of
further information.
TYPES
5. Several protein sequence databases act as
repositories of protein sequences. These
databases add little or no additional information
to the sequence records they contain.
e.g. GenPept, NCBI’s Entrez Protein, e-
Reference Sequence
SEQUENCE REPOSITORIES
6. Although repositories are an essential
means of providing the user with
sequences as quickly as possible, it is
clear that, when additional information is
added to a sequence, this greatly
increases the value of the resource for
users.
The curated databases enrich the sequence
data by adding additional information,
which gets validated by expert biologists
before being added to the databases to
ensure that the data in these collections
can be considered to be highly reliable.
UNIVERSAL CURUTED DATABASES
7. SWISS-PROT is a universal protein sequence
database established in 1986 and
maintained collaboratively, since 1987, by
the Department of Medical Biochemistry of
the University of Geneva and the EMBL
Data Library.
The leading universal curated protein
sequence database is Swiss-Prot, which
contained 140 000 curated sequence
entries from over 8300 different species as
on November 2003.
SWISS - PROT
8. The database is non-rebundant, which
means that all reports for a given protein
are merged into a single entry, and is highly
integrated with other databases .Each entry
in Swiss-Prot is thoroughly analyzed and
annotated by biologists to ensure that the
database is of a high quality.
The SWISS-PROT database distinguishes
itself from other protein sequence
databases by three distinct criteria
i.e.High level of annotation, a minimal
level of redundancy and high level of
integration with other databases.
9. Established in 1984 by the National
Biomedical Research Foundation (NBRF) as
a resource to assist in the identification
and understanding of protein sequence
information.
The PIR database evolved from the original
NBRF Protein Sequence Database,
developed over a 20 year period by the
late Margaret O. Dayhoff and published as
the ‘Atlas of Protein Sequence and
Structure.
THE PROTEIN INFROMANTION
RESOURCE PIR
10. The database is partitioned into four
sections; PIR1, PIR2, PIR3 and PIR4
These differ in terms of quality of data.
Currently PIR1 and PIR2 account for ∼99% of all
entries. Entries in PIR1 are fully classified, fully
merged and extensively annotated.
THE PROTEIN INFROMANTION
RESOURCE PIR
11. SCOP: A Structural Classification of
Proteins database.
Class Architecture Topology
Homologous (CATH):-
PROTEIN STRUCTURE DATABASE
12. This database provides a detailed and
comprehensive description of the structural
and evolutionary relationships of the
proteins of known structure.
A fundamental unit of classification in scop
is the protein domain.The first release of
scop in 1995 comprised 3179 domains, 498
families, 366 super families and 279 folds.
SCOP: A STRUCTURAL
CLASSIFICATION OF PROTEINS
DATABASE
13. The classification of the proteins is
on hierarchical levels:
Family
Super family
Common fold
Class
SCOP
14. The CATH database is a classification of
protein domains based not only on
sequence information, but also on
structural and functional properties.
The first CATH release from 1997
contained only 8,078 domains.
In addition to the four main levels, CATH
comprises five more layers, called S, O, L,
I and D. The first four layers group
domains according to increasing
sequence overlap and similarity whereas
the D-level assigns a unique identifier to
every domain.
CATH