PROTEIN DATABASE

NAVEED UL MUSHTAQ
DEPT OF BIORESOURCES KU

 With the availability of over 165
completed genome sequences from both
eukaryotic and prokaryotic organisms,
efforts are now being focused on the
identification and functional analysis of
the proteins encoded by these genomes.
 The large-scale analysis of these proteins
has started to generate huge amounts of
data due to the new information provided
by the genome projects and to a range of
new technologies in protein science.

 For example, mass spectrometry approaches are
being used in protein identification and in
determining the nature of post-translational
modifications. These and other methods make it
possible to quickly identify large numbers of
proteins, to map their interactions, to determine
their location within the cell and to analyze
their biological activities.
 Protein sequence databases play a vital role as a
central resource for storing the data generated
by these and more conventional efforts, and
making them available to the scientific
community

 Universal protein databases cover proteins
from all species
 whereas specialized data collections contain
information about a particular protein family
or group of proteins, or related to a specific
organism.
 Universal protein sequence databases can be
further subdivided into two categories:
sequence repositories (depositories), in
which data are stored with little or no
manual intervention in the creation of the
records.

 And expertly curated databases, in which
the original data are enhanced by the
addition of further information

 Several protein sequence databases act as
repositories of protein sequences. These
databases add little or no additional
information to the sequence records they
contain
 e.g. GenPept, NCBI’s Entrez Protein, e
Reference Sequence

 Although repositories are an essential means
of providing the user with sequences as
quickly as possible, it is clear that, when
additional information is added to a
sequence, this greatly increases the value of
the resource for users.
 The curated databases enrich the sequence
data by adding additional information, which
gets validated by expert biologists before
being added to the databases to ensure that
the data in these collections can be
considered to be highly reliable.

 SWISS-PROT is a universal protein sequence
database established in 1986 and maintained
collaboratively, since 1987, by the
Department of Medical Biochemistry of the
University of Geneva and the EMBL Data
Library
 The leading universal curated protein
sequence database is Swiss-Prot, which
contained 140 000 curated sequence entries
from over 8300 different species as on
November 2003.

 The database is non-redundant, which means
that all reports for a given protein are
merged into a single entry, and is highly
integrated with other databases .Each entry
in Swiss-Prot is thoroughly analyzed and
annotated by biologists to ensure that the
database is of a high quality.
 The SWISS-PROT database distinguishes itself
from other protein sequence databases by
three distinct criteria i.e. High level of
annotation, a minimal level of redundancy
and high level of integration with other
databases.

 Established in 1984 by the National
Biomedical Research Foundation (NBRF) as a
resource to assist in the identification and
understanding of protein sequence
information.
 The PIR database evolved from the original
NBRF Protein Sequence Database, developed
over a 20 year period by the late Margaret O.
Dayhoff and published as the ‘Atlas of
Protein Sequence and Structure.

 The database is partitioned into four
sections; PIR1, PIR2, PIR3 and PIR4
 These differ in terms of quality of data.
Currently PIR1 and PIR2 account for ∼99% of
all entries. Entries in PIR1 are fully
classified, fully merged and extensively
annotated.

 SCOP: a Structural Classification of Proteins
database
 Class Architecture Topology Homologous
(CATH):-

 This database provides a detailed and
comprehensive description of the structural
and evolutionary relationships of the proteins
of known structure
 A fundamental unit of classification in scop is
the protein domain.The first release of scop
in 1995 comprised 3179 domains, 498
families, 366 super families and 279 folds.

 The classification of the proteins is on
hierarchical levels:
 Family
 Super family
 Common fold
 Class

 The CATH database is a classification of
protein domains based not only on sequence
information, but also on structural and
functional properties
 The first CATH release from 1997 contained
only 8,078 domains
 In addition to the four main levels, CATH
comprises five more layers, called S, O, L, I
and D. The first four layers group domains
according to increasing sequence overlap and
similarity whereas the D-level assigns a
unique identifier to every domain.

PROTEIN DATABASE

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to PROTEIN DATABASE

Similar to PROTEIN DATABASE (20)

More from naveed ul mushtaq

More from naveed ul mushtaq (10)

Recently uploaded

Recently uploaded (20)

PROTEIN DATABASE