2. 1/23/2024 Computational Structural Biology (BIO455) - CC 87
Biological data are complex, exception-ridden, vast and incomplete. A collection of biological data arranged
in computer readable form that enhances the speed of search and retrieval and convenient to use is called
biological database.
The main purpose of a biological database is to store and manage biological data and information in
computer readable forms.
A range of information like
biological sequences
structures
binding sites
metabolic interactions
molecular action
functional relationships
protein families, motifs and homologous
can be retrieved by using biological databases.
Biological databases
3. 1/23/2024 Computational Structural Biology (BIO455) - CC 88
It can also be called an archival database since it archives the experimental results submitted by the
scientists.
The primary database is populated with experimentally derived data like genome sequence, macromolecular
structure, etc. The data entered here remains uncurated (no modifications are performed over the data).
It contains unique data obtained from the laboratory and these data are made accessible to normal users
without any change.
The data are given accession numbers when they are entered into the database. The same data can later be
retrieved using the accession number. Accession number identifies each data uniquely and it never changes.
Examples –
Nucleic Acid Databases: GenBank and DDBJ
Protein Databases: PDB,SwissProt, PIR, TrEMBL, Metacyc, etc.
Primary databases
4. 1/23/2024 Computational Structural Biology (BIO455) - CC 89
The data stored in these types of databases are the analyzed result of the primary database.
Computational algorithms are applied to the primary database and meaningful and informative data is
stored inside the secondary database.
The data here are highly curated(processing the data before it is presented in the database).
A secondary database is better and contains more valuable knowledge compared to the primary database.
Examples:
InterPro (protein families, motifs, and domains)
UniProt Knowledgebase (sequence and functional information on proteins)
Secondary Database:
5. 1/23/2024 Computational Structural Biology (BIO455) - CC 90
The data entered in these types of databases are first compared and then filtered based on desired criteria.
The initial data are taken from the primary database, and then they are merged together based on certain
conditions.
It helps in searching sequences rapidly. Derived Databases contain non-redundant data.
Derived Databases
Examples:
SCOP, CATH, KEGG
7. 1/23/2024 Computational Structural Biology (BIO455) - CC 92
Protein Sequence Databases
PIR
( https://proteininformationresource.org/ )
PIR (Protein Information Resource) is a popular protein sequence database that provides information on
functionally annotated protein sequences.
PIR maintains three databases, the Protein Sequence Database (PSD), the Non-redundant Reference (NREF)
sequence database, and the integrated Protein Classification (iProClass) database, which contains
annotated protein sequences, classification information, and protein family, function, and structure
information.
8. 1/23/2024 Computational Structural Biology (BIO455) - CC
93
SWISS-PROT
(integrated with Uniprot)
SWISS-PROT is a protein sequence database that provides high levels of annotations, including
information on the protein’s function, domain structure, post-translational modifications, and variants.
Swiss-Prot is jointly managed by the SIB (Swiss Institute of Bioinformatics) and the EBI (European
Bioinformatics Institute).
The database distinguishes itself from other protein sequence databases by three criteria:
(i) annotations, which cover a broad range of information,
(ii) minimal redundancy, which ensures that each sequence is represented only once, and
(iii) integration with other databases, which enables cross-referencing and retrieval of information from
related databases.
TrEMBL
TrEMBL is a computer-annotated supplement of Swiss-Prot. TrEMBL entries follow the Swiss-Prot format.
It contains all the translations of EMBL (European Molecular Biology Laboratory) nucleotide sequence entries
that have not yet been integrated into Swiss-Prot.
9. 1/23/2024 Computational Structural Biology (BIO455) - CC 94
Protein Structure Databases: Protein Data Bank (PDB)
Protein structure databases are collections of information related to the
three-dimensional structure and secondary structure of proteins.
The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological
molecules, such as proteins and nucleic acids.
The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron
microscopy, and submitted by biologists and biochemists from around the world, are freely accessible on
the Internet via the websites of its member organisations, PDBe, PDBj, RCSB, and BMRB
Most major scientific journals and some funding agencies now require scientists to submit their structure
data to the PDB.
Many other databases use protein structures deposited in the PDB. For example, SCOP and CATH classify
protein structures, while PDBsum provides a graphic overview of PDB entries using information from
other sources, such as Gene ontology.
www.wwpdb.org
ebi.ac.uk
www.rcsb.org
bmrb.io
pdbj.org
10. 1/23/2024 Computational Structural Biology (BIO455) - CC 95
I. Class: Types of folds, e.g., beta sheets.
II. Fold: The different shapes of domains within a class.
III. Superfamily: The domains in a fold are grouped into superfamilies, which have at least a distant common ancestor.
IV. Family: The domains in a superfamily are grouped into families, which have a more recent common ancestor.
V. Protein domain: The domains in families are grouped into protein domains, which are essentially the same protein.
VI. Species: The domains in "protein domains" are grouped according to species.
VII. Domain: part of a protein. For simple proteins, it can be the entire protein.
Structural Classification of Proteins (SCOP) database
Manual classification of protein structural domains based on similarities of their structures and amino acid
sequences.
A motivation for this classification is to determine the evolutionary relationship between proteins.
Proteins with the same shapes but having little sequence or functional similarity are placed in different
superfamilies, and are assumed to have only a very distant common ancestor.
Proteins having the same shape and some similarity of sequence and/or function are placed in "families",
and are assumed to have a closer common ancestor.
http://scop.mrc-lmb.cam.ac.uk/scop/
11. 1/23/2024 Computational Structural Biology (BIO455) - CC 96
1.All alpha proteins: Domains consisting of α-helices
2.All beta proteins: Domains consisting of β-sheets
3.Alpha and beta proteins: Mainly parallel beta sheets (beta-alpha-beta units)
4.Alpha and beta proteins (a+b): Mainly antiparallel beta sheets (segregated alpha and beta regions)
5.Multi-domain proteins (alpha and beta): Folds consisting of two or more domains belonging to different classes
6.Membrane and cell surface proteins and peptides: Does not include proteins in the immune system
7.Small proteins : Usually dominated by metal ligand, cofactor, and/or disulfide bridges
Classes
Folds
Each class contains a number of distinct folds. This classification level indicates similar tertiary structure,
but not necessarily evolutionary relatedness.
For example, the "All-α proteins" class contains >280 distinct folds, including:
Globin-like (core: 6 helices; folded leaf, partly opened),
long alpha-hairpin (2 helices; antiparallel hairpin, left-handed twist) and
Type I dockerin domains (tandem repeat of two calcium-binding loop-helix motifs)
Domains within a fold are further classified into superfamilies.
This is a largest grouping of proteins for which structural similarity is sufficient to indicate evolutionary relatedness
and therefore share a common ancestor.
For example, the two superfamilies of the "Globin-like" fold are: the Globin superfamily and alpha-helical
ferredoxin superfamily
Superfamily
12. 1/23/2024 Computational Structural Biology (BIO455) - CC 97
CATH database
cathdb.info
The CATH Protein Structure Classification database is a free, publicly available online resource that provides
information on the evolutionary relationships of protein domains.
The four main levels of the CATH hierarchy:
1. Class: The overall secondary-structure content of the domain. (Equivalent to the SCOP Class)
2. Architecture: High structural similarity but no evidence of homology.
3. Topology/fold: A large-scale grouping of topologies which share particular structural features (Equivalent
to the 'fold' level in SCOP)
4. Homologous superfamily: Indicative of a demonstrable evolutionary relationship. (Equivalent to SCOP
superfamily)
13. 1/23/2024 Computational Structural Biology (BIO455) - CC 98
Protein-Protein Interaction Databases
Protein-protein interaction databases are collections of information on the interactions between proteins.
Relationships between different proteins and their functions in biological systems.
BIND (https://bio.tools/bind )
BIND (Biomolecular Interaction Network Database) is a database that stores detailed descriptions of interactions,
molecular complexes, and pathways between various biomolecules, including proteins, nucleic acids, and small
molecules.
The database is designed to be used for data mining and can be used to study networks of interactions and map
pathways across different species. The database can also provide information for kinetic simulations.
DIP (https://dip.doe-mbi.ucla.edu/dip/Main.cgi )
DIP (Database of Interacting Proteins) is a database that contains protein-protein interaction information that has been
compiled through both manual curations and computational methods.
It is useful for understanding protein functions, and their relationships with other proteins. It can also be used to study
the properties of networks of interacting proteins, evaluate predictions of protein-protein interactions, and explore the
evolution of these interactions.
MINT (https://mint.bio.uniroma2.it/ )
MINT (Molecular Interaction) is a database that stores information on functional interactions between biological
molecules such as proteins, RNA, and DNA.
It also stores information on enzymatic modifications of partner molecules.
The database primarily focuses on experimentally verified protein-protein interactions and considers both direct and
indirect relationships.
14. 1/23/2024 Computational Structural Biology (BIO455) - CC 99
Protein Pattern and Profile Databases
Protein pattern and profile databases contain information on motifs found in sequences.
Sequence motifs correspond to structural or functional features in proteins.
So, the use of protein sequence patterns or profiles is a valuable tool in determining the function of proteins.
InterPro (https://www.ebi.ac.uk/interpro/ )
InterPro is a database that contains information on protein families, domains, and functional sites.
It was created by combining several major protein signature databases, including PROSITE, Pfam, PRINTS, ProDom, and
SMART into a single comprehensive resource.
PROSITE (https://prosite.expasy.org/ )
PROSITE is a collection of signatures that identify patterns or profiles in proteins, which can provide information on
their biological functions.
The signatures in the database are linked to annotation documents that provide information on the protein family or
domain detected, including its name, function, 3D structure, and references.
15. 1/23/2024 Computational Structural Biology (BIO455) - CC 100
Metabolic Pathway Databases
Metabolic pathway databases contain information about enzymes, biochemical reactions, and metabolic pathways.
ENZYME (https://enzyme.expasy.org/ )
ENZYME is a database that stores information on enzyme nomenclature.
It is used as the nomenclature source for enzyme names and reactions by most metabolic databases as well as by other
biomolecular databases.
KEGG (https://www.genome.jp/kegg/pathway.html )
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive database that maps out molecular and cellular
pathways involving interactions between genes and molecules.
It is composed of pathway maps, molecule tables, gene tables, and genome maps, and is used to build functional maps
of metabolic and regulatory pathways.