Protein Structure Project - Presentation Transcript
PROJECT
Evaluation of Information about Protein Structures from Genome Sequencing Projects
PPS 2007/08
PG Cert. Principles of Protein Structure
Birkbeck, University Of London
Sami El-Sabbahy
ID Number:
12408964
Abstract
Protein Structures are determined, sequenced and visualised from Genome Sequencing Projects is commonly referred to as Structural
Genomics. Structural Genomics is a specific area of bioinformatics, which uses a mixture experimentally determined sequential information
of nucleic/polypeptide sequences and 3-Dimensional protein structures along with high levels of computational prediction to build 3-
Dimentional Structures of unknown/undetermined proteins. In this document how the structures are stored, where they are stored and how
the protein structures are predicted are reviewed as well as what types of information about protein structures are contained and how they are
expressed within databases, which will be examined broadly and discussed.
1
1. Introduction
The genome project is the mapping and sequencing the genome of not only humans but also of other organisms. Part of this sequencing is
“Bioinformatics” is the study, design and the use of computational and mathematical tools to process biologically-derived data. In the 1980’s
the US government department of energy commissioned the first genome project that mapped the physical and genetic aspects of the human
genome. After which saw the formation of three sequencing centres, which are DNA Databank of Japan (DDBJ), the European Molecular
Biology Laboratory (EMBL) and the American GenBank at NCBI. The primary function of these three organisations was to create and to
maintain Nucleotide sequences Databases, Protein sequence Databases, Structural Classification Databases, Sequence Alignment and
Database Searching, Protein Structure, RNA Structure, Protein Structure Prediction/Modelling and Phylogeny and Biodiversity. The original
genome project was geared towards the human genome, which was expanded to include such organisms such as Escherichia coli,
Saccharomyces cerevisae, Drosophila melanogaster, Caenorhabditis elegans. Within the realms of bioinformatics and the genome project
there is extensive information about protein structure, polypeptide, RNA and DNA sequences. For the last couple of decades the number and
amount of information about these sequences and structures have increased exponentially.
Protein Structures are made from folded polypeptide chains, which are made from several different sequences of 20 different peptide units
“amino acids”. The amino acid sequences forms a polymer chain, known as the “Primary Structure”. The sequences of amino acid residues
fold into a specific conformation, through atomic bonding and molecular interactions, which consist of loops, turns, α Helices and β Sheets
known as the “Secondary Structure”. The way and how the α Helices and β Sheets fit and build a more extensive structure is known as the
“Tertiary Structure”. When there is more than one polypeptide in a protein, the sequence is bound together through covalent bonding, this is
known as the “Quaternary Structure”. Peptides are the basic unit of a protein, these units contain two functional group which are the Amino
group (NH2) and a Carboxyl group (COOH) along with an organic group, also known as the R group, which not only varies in it size as well
as chemical and molecular properties. These are groups are connected by a central Carbon atom known as the α carbon. Amino acids
(peptides) form chain, sequences of any number and varying amino acids residues, which I connect, as said above, through covalent bonds
known as peptide bonds. The first three dimensional structure that was determined, was that of Deoxyribonucleic Acid (DNA), by James
Watson and Francis Crick using X-Ray Diffraction by Rosalind Franklin and Maurice
Wilkins in 1953. Protein structural analysis is done in one of two which are either
through experimental techniques (structural determination) or modelling techniques
(structural prediction, these techniques are not only to determine a proteins structure
but also to visualise the structure of proteins. The experimental techniques that are
involved in determining the protein structure are as follows: X-Ray Diffraction
(crystallography), Nuclear Magnetic Resonance (NMR) and Cryo-Electron
Microscopy (CET). The structural predictive techniques that are used to fast track the
determination and modeling of protein structures are Ab Initio modelling methods and
Homology modelling methods. The first protein to be sequenced was Insulin
(hormone) in 1955 and the first enzyme was Ribonuclease in 1960. The first method of Francis Crick and James D. Watson.
sequence determination was through Edman Degradation (Dansylation) which was ADAPTED FROM: Chemistry & Biochemistry
Department, University of California [1].
later transferred to Mass Spectrometry in 1979.
The techniques are used to determine the sequence of polypeptide sequences, when the sequences have been determined the way the
sequence folds and the peptide side chain interactions can be determined and the molecular dynamics energy minimization and mechanics
need to betaken into account. There are a variety of software which is used to visualise protein structures these structures are maintained and
stored within the databases which have been stated above. The use of genomes for the purposes of determining the structure of molecules
such as proteins is known as “Structural Genomics” (SG). Structural Genomics through structural biology has enabled the discovery of
relationships between amino acid sequences and protein structures and allowed information and concepts about protein family, fold, and
super family to be developed. This has further enabled the detailing of taxonomies understanding of the three-dimensional shapes of proteins.
2
2. Protein Structures
A protein is a polymer chain that is built from monomers units known as amino acids. A proteins structure as well as its function is
determined by the sequence and properties of the monomeric sequence of peptides. In a proteins structure there are four successive levels of
its organisation, which is the: primary (1°), secondary (2°), tertiary (3°), and quaternary (4°). As stated above the primary structure of a
protein is a linear sequence of peptides (monomer) in a polypeptide chain (polymer). The secondary structure is the individual geometric
formation that is created from the polymeric chain. The tertiary structure is the folding that occurs with each secondary structure, whilst the
quaternary structure is the organization of protein subunits (Branden, C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford,
D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004).
The monomeric units of a polymer chain of a protein, is the peptide
which has five separate constitutes which are the: central alpha carbon
atom and four substituent units that are connected to it, these are the
alpha proton -H, the side chain –R, the carboxylic acid functional group
(-COOH) and the amino functional group (-NH). With the exception of
Glycine all alpha carbons are asymmetric, each peptide containing the
asymmetric α carbon atom is in an L-isomer, see figure 2.1 (Branden,
C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford,
D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004). Figure 2.1 - Amino Acid Molecular Structure: Shows a
simplistic diagram of structural componants and
functional groups an amino acid. ADAPTED FROM: the
Protein ChemCards, Chemistry Department at Hibbing
Community College [2].
Polypeptide chains are formed through condensation reactions, which
occur when water is produced, when the amino group of one peptide
reacts and bonds to the carboxyl group of another, forming a covalent
C-N bond. The use of primary databases is used to match polypeptide
sequences to information contained within the primary databases. The
primary database contains sequential information derived primarily
from DNA/RNA sequence analysis, see figure 2.2 (Branden, C, et. al.
1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005;
Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004).
Figure 2.2 - Amino Acid Molecular Structure: Shows the
covalent bonding that occures between the amino group
of one amino acid with the carboxyl group of another
amino acid. ADAPTED FROM: Department of Biology,
University of Winnipeg [3].
3
The secondary structure is the spatial arrangement of a segment of a
polypeptide sequence. There are three major structural conformations
that commonly occur in the secondary structure, which are the alpha
helices, beta sheets and turns. Structural conformations in the secondary
structure when all Φ bond angles in that polypeptide segment are equal
to each other, and all the ψ bond angles are equal. The alpha and beta
helix structures are thermodynamically stable, whereas some selected
amino acids support the turns. The conformation of a proteins
secondary structure is dependent on the properties of the sequence and
the amino acids within the sequence, see figure 2.3 (Branden, C, et. al. Figure 2.3 - Secondary α and β structures: Shows the
1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005; two most common secondary structures in a protein
being the αand β. ADAPTED FROM: Department of
Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004).
Biology, University of Winnipeg [3].
Within the secondary structure there are distortions which occur, such as alpha helical curvature which is due to the amino acid bonding of
CO and NH which form hydrogen bonds with amino acids 3 residues along which produces the 310 helical structure (3 is the number of
amino acids between the CO and NH hydrogen bond, whereas the 10 is the number of atom contained within the ring). Within the protein
structure, certain proteins contain an additional hierarchy of secondary structural organisation; this is the ordered set up of the secondary
structure known as the super secondary structure. The ordered organisation of the secondary structure, forms structurally functional sections
of the protein, known as motifs such as the: Helix-Turn-Helix, Leucine Zipper, Helix-Loop-Helix and Zinc Finger domains (Branden, C, et.
al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004).
The tertiary protein structure refers to the three dimensional units of the structure a protein, it relates to the relationship between the spatial
parameters of the secondary structure of one polypeptide and the spatial parameters of the secondary structure of a different polypeptide and
how they fold. The tertiary structure primarily relates to the interactions between numerous domain/motifs through hydrogen bonding,
hydrophobic interactions, electrostatic interactions and Van Der Waals forces (Branden, C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et.
al. 2007; Whitford, D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004).
The classification of domains such as; all α domains (contain only α-
helices and folds), all β domains (contain only β-sheet parallel/anti-
parallel directions) such as the Greek Key motif, α+β domains
(containing both all α & all β domains) and α/β domains (contain β-α-β
motifs). The all α-helix, all β-sheets and α/β domain classification are
incorporated into the CATH domain database, whilst the α+β domains
are not, there is extensive overlapping of structures in this domain.
Structural alignment is also used a structural database tool to determine
and “classify” the domains of an unknown protein structure, see figure
2.4 (Branden, C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al.
2007; Whitford, D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. Figure 2.4 - Tertiary Structure: shows the interactions
2004). that take place in the secondary structure in forming the
tertiary structure. ADAPTED FROM: Department of
Biology, University of Winnipeg [3].
The Quaternary structure describes the structure of proteins that contain numerous subunits (multiple polypeptide monomers). The
quaternary structure is basically the arrangement all the monomeric units, within the three dimensional structure of the protein, the best
example would be that of Hemoglobin. Hemoglobin is composed of four monomeric units. Monomeric units are either identical (homo) of
different (hetero), therefore a multimeric/oligomeric protein with identical monomeric units is called “Homomer” whilst a protein with
different monomeric units is called a “hetromer” (Branden, C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005;
Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004).
4
3. Protein Structural
Resources & Databases
The main aim of the genome project was to map the sequence
Table 01: Genomic Databases
and physical parameters of genomes, bioinformatics made it Primary Databases Secondary Databases
possible to computerize, archive and retrieve sequential and Nucleic Acid Protein Protein
EMBL PIR PROSITE
structural information. There exist now several databases which GenBank MIPS Pfam
contain sequential information of nucleic acids and peptides as DDBJ SWISS-PROT SCOP
TrEMBL CATH
well as a detailed catalogue of protein structures. There are two NRL-3D Protein Databank
main databases, which are primary and secondary databases (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al.
2003; Lesk, A. et. al. 2005).
Primary database are archival, which contains information derived from experimental analysis, which contain unprocessed sequence data in
which there are three main databases which are the are DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory
(EMBL), and the American GenBank at NCBI. The primary database mainly contains sequences of nucleotides/peptides, which are
annotated using letters, in which the most common format is CDS. Protein sequence databases include Swiss-Prot & PIR and for
genome/nucleotide sequence databases include GenBank & DDBJ. The primary protein sequence databases, includes UNIPROT (Universal
Protein Resource), which is wide-ranging catalogue protein information. The information that is created maintained and contained within the
catalogue come from the databases Swiss-Prot, TrEMBL, and PIR. UNIPROT/TrEMBL can be accessed and used through ExPASy (Expert
Protein Analysis System) Proteomatics Server. UniProtKB/Swiss-Prot (Protein Database) contains all the translated nucleotide sequence
entries of EMBL that are not integrated into Swiss-Prot, in which TrEMBL contains 232,345 entries whereas UniProtKB/TrEMBL contain
computer-annotated supplement sequence entries, which contains translations of EMBL nucleotide sequence entries Swiss-Prot (Attwood
T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
Secondary databases are curatorial, which contain derived information that contains a review of the relevant information. The Secondary
Databases contain structural and sequential information about sequences which have been extensively transcribed. A secondary database
contains derived information from the primary database. A secondary sequence database contains information like the conserved sequence,
signature sequence and active site residues of the protein families arrived by multiple sequence alignment of a set of related proteins. These
sequences include data entries about secondary structures of proteins, which are classified by motif and domain structure like all alpha
proteins, all beta proteins, etc. There are several structural databases formed and maintained by individual laboratories these include SCOP
(Cambridge University), CATH (University College London), PROSITE (Swiss Institute of Bioinformatics) and eMOTIF (Stanford).
Secondary protein databases are pattern databases, which uses multiple alignments of homologous sequences, there are several different
secondary databases. These databases store different information about protein structures, the first database we will look at is PROSITE: this
database uses the primary database SWISS-PROT as its major source of information. The patterns and entries that are generated using
PROSITE are short patterns. Like SWISS-PROT, Pfam is a database, which has a large number of multiple sequence alignments. Pfam uses
Markov Models to create protein family or domain signatures. There are two forms of alignment, which are created and stored, the first being
Pfam-A are edited by hand and are accurate. Pfam-B is derived from automatic clustering of SWISS-PROT and is less reliable. A third
secondary database is known as SMART, it is a Molecular Architecture Research Tool, which expresses in visual terms the domains that are
present within protein sequences (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
Structural Classification Databases enables the classification of the three dimensional structures of proteins that exist and allows the
comparison of structural similarities and differences. In general proteins with a similar sequences and functions will adopt a analogous
overall three dimensional structure – predominantly for the crucial active site residues. There are two Structural Classification database
schemes, which are: CATH and SCOP. In addition to this there are composite database, which uses several primary database, this type of
database search multiple resources. There are, a multitude of diverse composite database, which not only use different primary database but
also uses different search criteria. One such composite database is the National Centre for Biotechnology Information (NCBI), which not
only contains nucleotide and protein databases, hosted my massive and openly available computer servers, it is also linked to OMIM (Online
Mendelian Inheritance in Man) (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
5
4. Experimental Protein Determination
To determine the structure as mentioned in the introduction, the first
experimental technique which is X-Ray Crystallography, which
requires the purification and growth of protein crystals. The protein
crystals are aligned with a high level of accuracy and made into rigid
structures and mounted on a Goniometer then X-Ray beams, of wave
length 0.1 to 0.2nm, are passed through the crystalline structure. The X-
Rays on passing through the crystalline protein structure would scatter
and reflect (the reflection can be determined by the use of Bragg’s Law)
which would produce and diffraction pattern on a photographic film.
The 3-Dimentional structure can be determined through the intensities
of the dark spots on the film, known as “diffraction maxima” and are
taken at different rotations round the crystal, which are used to
mathematically calculate and construct the 3-Dimentional structure, Figure 4.1 – X-Ray Crystallography: shows the
“diffraction maxima” contain within the diffraction
using Fourier Transforms. X-Ray Crystalline structures are useful for
patterns. ADAPTED FROM: The Yale Scientific
accounting for electronic and elastic proteins as well as it is able to help Magazine [4].
determine chemical interactions of compounds, see figure 4.1 (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005).
The Second experimental technique the Nuclear Magnetic Resonance
(NMR) Spectroscopy, this method is used to determine relatively small
protein structures of about 30KDa. With this technique, proteins are
maintained with aqueous solution, in which none of the proteins need to
be crystallized, instead of using X-Ray diffraction this technique uses
magnetic fields and radio frequencies to determine a proteins structure.
The structure is determined through the differences between the spin of
each atom in a peptide chain and each atom acts differently depending
on which atom(s) it is bonded with and the closest peptide residues. A
proteins 3-Dimentional structure is determined from the magnitude of
the effect; the distances can be calculated and is used to generate a
structure representation. The result of NMR Spectroscopy is the NMR
spectrum (see figure 4.2), which shows linear signals representing each Figure 4.2 – NMR Microscopy: shows graphical
representation of Ubiquitin obtained from NMR.
compound/amino acid. The concentration of each amino acid within a ADAPTED FROM: The Department of Chemistry,
protein varies and therefore the characteristics of each linear peak Georgetown University [5].
changes, within the spectrum curve you see on the left (Branden, C, et.
al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005).
The Third experimental technique used in determining the structure of
proteins is Cryoelectron Microscopy, this techniques is primarily used
to determine the 3-Dimentional structure of multi-subunit proteins as
well as proteins that are not easily crystallized. This technique a protein
sample is rapidly frozen in liquid helium and the sample is then
examined under the Cryoelectron Microscope using lose dose electrons.
All images are then analysed using complex computer programs which
reconstructs the protein structure into a 3-Dimentional Image. The
Figure 4.2 – Cryo-Electron Microscopy: shows a
colourful wheel images are computer-generated models of the micrograph of an enzyme obtained from Cryo-Electron
Microscopy. ADAPTED FROM: The Brookhaven
molecular structure of the protein, which is superimposed over the
National Laboratory [6].
electron micrograph where the proteins are located in the array, see
figure 4.5 (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005).
6
5. Structural Modeling
Molecular Modelling is used to predict the 3-Dimensional structure of proteins to a high degree of confidence, when the sequence is
compared with proteins that have had there structure, experimentally determined. There are many types of molecular modelling techniques,
which enable the prediction and determination of the three dimensional structure of proteins. The 3-Dimensional structure of a protein is vary
important since it contains the information about biochemical function through binding sites, catalytic activity and the interactions between
the protein and other molecules such as the interaction with between proteins; proteins and nucleotide DNA/RNA and proteins & ligands.
The identification and the subsequent visualisation of ligand binding site allow the design and synthesis of new drugs, currently there are
several projects which are running. The size and shape as well as the three-dimensional geometry of each ligand can be visualised using the
software tool and can be used to see the interactions between a protein and ligand “known as the lock and key”. Within protein prediction,
globular proteins their structures are far more easily determined through computational prediction, whilst experimental determination for all
proteins remains difficult regardless of structural conformation (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R.
2002).
There are three main computational prediction techniques used to determine the structure of proteins, which are: The first being Comparative
(homology) modelling, which predicts an unknown proteins structure from its sequence, by aligning the sequences with that of a known and
experimentally determined protein. A structure is then modelled based on the strength of the similarity of the unknown sequence with that of
the known sequence. The second computational prediction technique is: Fold recognition (threading) modelling is used if there are no known
homologues of the unknown protein structure/sequence. This method compares the unknown sequences of an unknown protein with that of
known protein folds. This uses a score system, sections of the unknown sequence is then scored against known folds. The third
computational prediction technique is: the Ab Initio method, which determines the structure from first principles without any reference to
protein structures. Then modelled from empirical/semi-empirical, experimental results using models of atoms and related molecules. The
characterization of the functional purpose of a protein is difficult, which is why accurate three-dimensional structures of proteins are
produced using the methods of modelling which have been mentioned above and are described in detail below. The structure of protein is
both determined in nature by the laws of physics and the theory of evolution (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999;
Samudrala, R. 2002).
Comparative or Homology modelling uses known parental protein structures to build protein structures from sequential and structural
comparison techniques, which have four main stages to create a viable protein structure. A polypeptide sequence is first of all aligned with its
parent polypeptide sequence as well as other homologous sequences of the same origins. A primary “framework” structure is formed using
the parent structure for the new polypeptide sequence, then additional structural loops and folds as well as extended structure are modelled,
which is then refined using side chain geometry and packing. This technique is highly dependant on not only the accuracy of the alignment
of the sequences but also the extent or the level at which each sequence is related. Homology modelling identifies protein structures that are
similar to the target protein through sequence comparison. The quality of homology modelling depends on protein structure, i.e. motif and
domain (including helical and strand) sequence similarity. This technique revolves around the idea that protein can be related and can evolve
to for families and have a distinct origin, hence Orthologus proteins. This modelling uses prediction methods, which has four stages, the first
being to identify structural templates from a protein structure databases, the second stage is to use alignment tools to obtain structural
templates. The third stage would be to build a backbone and then lastly to incorporate the side-chains. The homology method determines a
sequence similarity by aligning the sequences optimally. The aligned residues of amino acids in a polypeptide sequence of a proteins
structure is used to create a model. The better the comparison and the higher quality of the alignment the better the accuracy of the model
created. Another factor is the ability to determine and to detect, if there are homologous (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult,
J. 1999; Samudrala, R. 2002).
This technique of modelling is best suited for sequential information, which has 50 percent of more similarity, but inaccuracies can appear or
become apparent during extrapolation and detailing of side chain positions as well as insertion of loops and extended structures of a protein
were segment of the sequential information does not match with parental sequences and structures. Homology modelling requires an
unsolved sequence to be inputted with a template “parent” sequence which has a high level of identity which has been determined through X-
Ray Crystallography or NMR, which both sequences are then aligned. The identity of the sequential information between the unknown
sequence and the template sequence, when aligned, must contain within both sequence backbones identical positions of the α-carbon as well
as the identical phi and psi angles and secondary structure. Searched for template sequences high identical sequences can be done through
using a database such as BLAST which then can be compared with structural/sequence information from PROTIEN DATA BANK. Blast
contains information of sequences including the level of similarity such as the “E-Value” and “P Value (probability)” in which low values of
P suggest that there are important biological matches of significance. Whilst Protein Data Bank contain detailed structural information of
proteins that have been determined through either X-ray crystallography or Nuclear Magnetic Resonance. The unknown sequences can be
compared with template sequences using SWISS-MODEL (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R.
2002).
The second method is: Threading Recognition is used instead of Comparative (Homology) modelling, when there are no known homology
structures that match the sequence. This type of modelling goes on the basis that the 3-D structures are conserved and that protein sequence
normally adopts folds which are similar, even without similarity within structure and/or function. Fold recognition aligns and scores
unknown sequences against the complete library of structural templates. It compares not only how they and the way they fold but also how
each structure would fit the sequence. It detects similarities across a modelled sequence and a known structure(s). This can only be done if at
least one of the proteins in the a protein family to be experimentally determined through X-Ray Diffraction, Nuclear Magnetic Resonance
(NMR) or Cryo-Electron Microscopy (CET). So to align the undetermined or unknown protein/sequence with the known structure, as stated
above this is also required for Homology Modelling (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002).
Ab initio modelling of a protein structure predicts the 3-D structure, using the thermodynamic hypothesis of protein folding, the
thermodynamic protein sequence corresponds to its global free energy minimum state. In which there are several different ways of
modelling through this method of prediction the being Rosetta and CASP (Critical Assessment of Techniques for Protein Structure
Prediction). Ab Initio methods are used when there are no templates using Homology are found, it is used to predict the protein structure. The
method use for this predictive modelling is as follows, which comes in several steps the first being define protein structure and
conformational space in a representation, which is then followed by representation of the energy functions with the protein structure. The
third step of this visualisation method would be to minimize energy functions. The final minimal energy conformation visualized would be
considered to be the actual structure of a native protein in normal surroundings. The folding of the protein is dictated by these physical forces
that act between molecules in every molecular structure. The de novo modelling method assume that the native structure corresponds to the
global free energy minimum accessible during the lifespan of the protein and attempt to find this minimum by an exploration of many
conceivable protein conformations. The two key components of de novo methods are the procedure for efficiently carrying out the
conformational search, and the free energy function used for evaluating possible conformations (Baker, D. et. al. 2001).
8
6. Structural Genomic Databases
The main purpose of genomic sequence and structural Table 02: Genomic Databases Information
databases is to be an archive, which is compact, durable Database Storage Information Stored
Flat Files Sequence
and standardised. These databases have two main Type of Relational Oriented 2 Dimensional
functions the first being the retrieval of sequences that Data Stored Database (Tables) Structure Images
Object Oriented 3 Dimensional
have been submitted directly into the database and
Database (Images) Images
second the interpretation of each sequence through
assigning functions to each section of the sequence as well as the elimination of artefacts. All the databases which are currently available can
be accessed via the web, as stated on numerous occasions throughout this project there are several database which can be accessed, which are
used not only to determine the structure of proteins but also the functions. Such as the Hierarchy of conservation of each protein, which
includes Hydrophobic Packing, Active Sites and Surface Residues ; Amino Acid Propensity; Globular & Functional Domains; Peptide
Backbone Conformation and Amino Acid Packing. The types of databases that have be reviewed above have different forms of information
stored which is dependant the manner of the Database Storage or nature of Information Stored (see Table 02: Genomic Databases
Information) (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
As discussed above there are several formats of information about Genomic Databases as well as there are different types of databases.
Which are used to not only to store information, but to also determine and to visualise a variety of aspects relating to structure and sequential
information. The generalised idea of having secondary and tertiary/composite genomic databases, is to enable the determination through
high-throughput of the three-dimensional structure and the analysis of other “biological” macromolecules, which included the classification
and structural make up of single protein domains as well as the determination of the relationship between polypeptide sequences and the
protein structures, across a range of different proteins. This includes the classification of protein families, folds and super-families, as well as
detailing of taxonomies (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
Protein structures and relationships are visually classified using the SCOP database which is accessible and maintained by the MRC, which
classifies globular proteins, using several hierarchal sections, which are the: class, folds, super-family, and family. The class refers to the
general structural architecture of each domain. The fold refers to the similarities and common aspects between secondary structures which
exhibit the same topology regardless of evolutionary origin. The super-family refers to proteins, which a little of low identical sequential but
exhibits related structural and functional similarities. Proteins are placed into a defined family if there is more than 30 to 50 percent
sequential similarity or identity. The structural classes which are encompassed within the SCOP database includes (Attwood T.K. et. al.
1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005):
o Mainly α.
o Mainly β.
o Alternating α-β (either α+β or α/β).
o Multi-Domain proteins.
o Membrane & Cell surface proteins or Peptides.
A second visual protein classification system database is CATH (Class, Architecture, Topology, Homology) which is accessible via the web
and is maintained by UCL. This classification system relies on and uses to a greater to extent automated methods and only used manual
inspection techniques when automated methods do not obtain results. Within this classification database there or five separate levels of
classification, which are: Class, Architecture, Topology, Homology and Sequence. The class of a protein is determined through the
secondary structure of the protein through the packaging of α-helix and β-sheets in the formation of domains; which there are four types of
packing, which are (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005):
9
o Mainly α
o Mainly β
o Alternating α-β (either α+β or α/β)
o Limited Secondary Structure Figure 6.1 - Domain Classes: Shows domains containing different
secondary structures either being all α, all α or being a mixture of α and
β. ADAPTED FROM: The Principles of Protein Structure '97, Birkbeck
College - University of London [9].
The second level of architecture refers to specific arrangements of each secondary structure, hence what connects them, which also described
the motifs that are created. The Topology refers to and describes the shape and how secondary structures connect. This is done through
structural comparisons through clustering of domains, where 60 percent or more of the protein structure has to be identical. The Homology
hierarchy level groups domains that have greater than 35 percent identical sequences. The fifth level of hierarchy which is the sequence: This
matches structures with greater than 35 percent sequence identity.There are other classifications such as the CATH database; both SCOP and
CATH both use manual inspection techniques and automated methods. These are used to differentiate between similar Analog’s and
Homolog’s. Manual techniques are used to separate between protein groups which exhibit similar structures and functions. Both SCOP and
CATH are effective ways of deriving both the specific structure and function and there relationships of proteins. The two databases enable
structural alignments to be carried out as well as the functional inferences to be expressed. And allows the visualisation of common sequence
features to be expressed certain topologies (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et.
al. 2005).
Protein Data Bank (PDB) is a specific protein structure database, which is maintained by Brookhaven National Laboratories, it’s primary
function is the submission and retrieval of all protein structures and offers 3D structural Data for both nucleic acid carbohydrates and
polypeptides. The structures that are submitted and are obtained from the vast database are mostly experimentally determined through X-Ray
Crystallography and NMR, which can be access by a public domain through the World Wide Web. The PDB files give a full description of
the 3-Dimensional structure of each protein continued within its database, which comes in a text format and is column oriented (Click Here)
along with other molecules such as water and drug compounds as well as ions. Two newer chemical file formats have been created, which
are mmCIF and MMDB and both contain data description languages. Each file contains atomic coordinates of each atom along with
annotations, comments as well as experimental details (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003;
Lesk, A. et. al. 2005).
Figure 6.2 - Protien Databank Figure 6.3 - Protien Databank Figure 6.4 - Protien Databank Figure 6.5 - Protien
Main Page: ADAPTED Search: ADAPTED FROM - Search Results: ADAPTED Databank Text Format:
FROM - Department of Cyberinfrastructure FROM - Chemistry Department, ADAPTED FROM - School
Bioinformatic & Life Sciences, Technology Watch (CTWatch) California Polytechnic State of Molecular and Microbial
Soonsll University [10]. [11]. University [14]. Sciences, University of
Queensland [15].
Within the PDB database, as depicted in the three figures above, contain not only information rich “text” information on each sequences it
also contains offers 3D structural images which can be viewed through visualisation software that is attached to each protein record in the
database. There are two types of information contained within this database on each sequence “implicit” and “explicit” which enables the
10
construction of a three dimensional protein structure. Each record “protein” that is contained within the database contains a three letter code.
The structural data bio-molecules “proteins”, which can be visualized, are visually represented through software such as: VMD, RasMol,
PyMOL, Jmol, MDL Chime and MBT Protein Workshop/Simple Viewer (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane,
D.E. et. al. 2003; Lesk, A. et. al. 2005).
NCBI is one of the largest and accessible databases,
which can be accessed by the public through the
World Wide Web. It hosts several different types of
databases, in which one of them is: Entrez, which
provides structural information of proteins, by
searching and compiling data from sources such as
SwissProt, PIR, PRF, PDB, and translations from
GenBank and RefSeq. Entrez is Global Query Cross-
Database, which is a powerful search engine that
allows a user to search and retrieve structural,
sequential and reference information from each
database contained/linked to the NCBI website. It
allows the viewing of both gene and protein sequences
Figure 6.6 - Entrez Database: Showing the database browsers.
along with chromosome maps as well as it integrates ADAPTED FROM: Biological Research Computer Hierarchy (BIRCH),
information from scientific literature, DNA & Peptide University of Manitoba [13].
sequence databases, 3D Protein Structure & Domain Data and taxonomic information to create a highly adaptive and connected system of
information. This online report will first focus on Entrez's Structure Index, which is an NCBI homepage that specifically relates to
visualisation and retrieval of the 3-Dimensional structure(s) of each protein (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane,
D.E. et. al. 2003; Lesk, A. et. al. 2005).
Each protein structure from a query lists all the results with a PDB number and names as well a generalized description of the protein in each
result. There are a list of links to protein, which accompanies each query result, these are the: MMDB Structure Summary page and to
Entrez; 3D Domains Index, Protein/Nucleotide Index, PubMed Citations Index & Entrez Taxonomy Index. The Entrez's 3D-Domain Index
page of the NCBI home page is to retrieve 3D domain information from domain queries. Each query would contain a list of domain names
including general descriptions about structures of each domain. Each 3-Dimensional Domain result is linked to an MMDB Structure
Summary page as well as to a VAST 3D-Domain Neighbours Summary along with a link to the: Entrez Structure Index, Entrez Protein,
Nucleotide Index, Entrez PubMed Citations Index and/or Entrez Taxonomy Index. The Entrez database like the Protein Data Bank database
contains experimentally determined three-dimensional structures, which were determined through either X-ray crystallography or NMR-
spectroscopy (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
VAST (Vector Alignment Search Tool) is a tool that enables the user search and locate the structural similarities between protein domains.
By searching through 3Dimensional-Domain Database and locating similar secondary structural arrangements. Each secondary unit or
domain is represented in the VAST database as a vector which is then aligned which can be performed on the inputted Protein Data Bank
files. The current Entrez database connects each of the 3-Dimensional domains to a list of polypeptides, which are also linked to related
“homologous” 3-Dimensional domains. Each 3-Dimensional vector element is derived solely from a protein’s secondary structure; no
sequential information is used during each search and is able to detect similarities between structures even without sequential similarities.
VAST as an alignment and database search tool is useful in the investigation of the relationship between the 3-Dimentional structures of
proteins in particular with the use of SCOP (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et.
al. 2005).
11
7. 3-Dimentional Structural Data
All 3-Dimentional Data is recorded as ball and stick model, which included details and dimensions of each atom in a ball and stick model.
This can be firstly obtained through the sequence of either nucleic acids or peptides by drawing and determining a 3-dimentional model of
the backbone of any given sequence. For each polypeptide sequence, the sequences would be determined always from the N-terminus also
known as the amino-terminus, NH2-terminus or Amine-terminus and by comparing each peptide from the structural composition,
conformation & orientation of the twenty most common amino acids, using a “residue library”. The chemical structure of the polypeptide
sequences is recorded and then the 3-Dimensional structural data would be measured and established through the measurement (in
angstroms) of each atom starting from the N-Terminus. Through this, the coordinates of each atom of the polypeptide chain on the x, y and z
axis is calculated and then recorded. Each structural database would not only store (archive) and maintain such records of each protein
molecule, which then can be retrieved through a accessible public web based sever, see Figure 6.5 (Click Here) (Attwood T.K. et. al. 1999;
Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
Regardless of the format of the file information is stared in, each format would contain structural coordinates which includes the spatial
locations of each individual atom within a protein along each dimensional axis x, y and z. In addition each atomic coordinate is labeled with
the element, residue and molecule the coordinate belongs and is known as a “chemical graph”. Chemical graphs like the creation of ball and
stick models uses a residue library of all twenty most common amino acids, it also contains tables of atom types and bond information
(Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
Molecular Visualization relies heavily on computer graphics, but also on computational prediction techniques and modelling, as outlined in
Section 4 of this document. The any of the specialised software which is used, all perform the same function of creating 3-Dimensional
pictures that can be rotated as well as be altered to show specific peptides, sequences and structures within a protein. Each protein
visualization representation of a protein is done by “connecting the dots” which are done by using two different “minimalist approach”
approaches, which are used in relation to the storage of information about the bonding between atoms/molecules, the physical rules of
chemistry are always observed. The first approach is the e “legacy approach” also known as the “chemistry rules approach”. This approach
does not use residue dictionaries, only bond length and type dictionaries; all visualisation software used in structural databases to graphically
express the data from PDB data files uses this approach. The second approach that is used is the “Molecular Modelling Database (MMDB)”
this approach derives the graphical representations of any 3-Dimensional protein structure by using data from not only contain within PDB
but also uses standard residue dictionaries (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et.
al. 2005).
12
8. Structural Visualisation
There are several styles and software that are used to graphically depict protein structures, this is primarily due to the need to visualize
particular aspects of a proteins structure, in which the main source of information used to create such graphical representation would be PDB
data files, which surplus regions are frequently edited to enable a user of molecular visualization software to visualize what they want. There
are a number of graphical outputs, which are (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et.
al. 2005):
o Wireframe Model format: Details the chemistry of a molecular structure,
see figure 8.1.
o Space-filled Model format: Details the size and surface of molecular structure,
see figure 8.2.
o Ribbon Model format: Details the organisation and path of secondary structure
elements and enable the identification of secondary structures in complex topologies,
see figure 8.3.
Figure 8.1 - Wireframe Model of a Figure 8.2 – Space-Filled Model of an Figure 8.3 – Ribbon Model of an HLA-A2
GFP molecule: ADAPTED FROM – HLA-A2.1 molecule: ADAPTED FROM molecule: ADAPTED FROM –
Center for BioMolecular Modeling – Department of Crystallography, Department of Crystallography, Birkbeck
(CBM), Milwaukee School of Birkbeck College [8]. College [8].
Engineering [7].
Structural file formats come in three different forms, which are firstly the pdb file format, which is column oriented textual file format that
describes three dimensional structures of molecules. Each pdb file contains a high level and extensive description of a protein’s properties.
Each pdb file contains hundreds of lines of information about atoms and their coordinates as well as the sequences of amino acids contained
with a protein (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
The first file format is the pdb file format, there are several section to a normal pdb file beginning with the HEADER which specifies the pdb
id code, the TITLE of the file that contain the name of the protein and AUTHOR lists the contributor and researchers, these records appear
first in a pdb file. The next section is the ATOM which is a record of each atom and lists each atom’s atomic coordinate (x, y and z) that are
part of the protein, the following section is the HETATM which is a record of the hetero-atoms and like the ATOM lists each atom’s atomic
coordinate (x, y and z) of each hetero-atoms. Hetero-atoms are not part of the overall protein. The following section of SEQRES it is a record
that holds and lists details of the primary protein sequence and peptide chains, which are denoted A, B and C within a single protein. The
following section is known as REMARK, this part of a pdb file contains standardized information and annotations and remarks about the
protein structure (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
13
The second file format is the Macro Molecular Crystallographic Information File (mmCIF),
is a file format which is made of several tokens that includes data blocks, this file format is
derived from Chemical Interchange Format (CIF). The mmCIF file format contains a
macromolecular CIF dictionary, in which each item of data is matched to an entry in the
CIF macromolecular dictionary, allow sequence validation to occur. It contains spatial
grouping and unit cell parameters as well as atomic coordinates like the pdb file. Within the
mmCIF each id is numbered integers, which hold the same information as pdb files; all data
names are case sensitive along with derived information from primary coordinate data
allowing less ambiguity to occur (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005;
Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Figure 8.4 - Rasmol: Shows the
structure of Thymidylate Synthase
(PDB ID: 2TSC), seen through the
The third file format is the MMDB file format, which uses the ASN.1 standardized data RasMol Viewer Software. The
RasMol generated image highlights
language, which borrows characteristics from other data for describing such things as
helices (orange) and sheets (green).
references and citing, this file format is either stored as text or binary files. This enables the ADAPTED FROM: the Protein
ChemCards, Bioinformatics Courses
representation of complex data types. All notations used for describing data are transferred
and Lectures [21].
or transmitted using telecommunication protocols and allowing the physical representation
of descriptive, atomic coordinate and sequential data through phone lines for access through
the web. This format is used by NCBI to store GeneBank, PubMed and MMDB (Attwood
T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al.
2005).
There are a number of different software, which are used to not only examine molecular
structures of proteins but also are used to display structural information and produce high
resolution 3-Dimensional pictures, which are mostly java based three programs that are able Figure 8.5 - Cn3D: Shows of
structure of the SRY protein,
to interpret protein databank data, excluding RasMol (Attwood T.K. et. al. 1999; through the Cn3D Viewers
Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Softwares. The Cn3D genrated
image highlights the DNA strands
backbone in blue and brown, whilst
First this document will look into RasMol viewers, is a java based molecular visualization the protein alpha helices are in green
& loops are in light blue. ADAPTED
software, it is one of the most widely used and most popular software’s and is seen as the FROM: Institute of Biology and
most accurate. RasMol uses chemical graphs and pdb files it does not validate either of Department of Medical Genetics,
Charles University [22].
these of the residue library or perform alignments of inherent sequences. Ramol as
molecular visualisation tool recalculates information and edits out inconsistencies and is
able to use mmCIF formatted files. RasMol is a free open source program, which is requires
toolkit library to enable it to create visual representations of proteins that are interactive.
RasMol images contain information such as the types of components, atom serial number,
atom name, coordinates of each atom, which are standardized and are expressed identically
each time used, see figure 8.4 (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005;
Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
Figure 8.6 - Jmol: Shows the
Cn3D like RasMol is a 3-Dimensional protein structure viewer, which specifically is used structure of Hemoglobin, The Jmol
to read, translate and view the 3-Dimensional structure encoded within MMDB data generated image highlights the
backbone using trace colors and the
records. Explicit bonding information can be used since without errors or unknown heme groups though spacefilled.
chemical graph expressions enabling a full and more reliable 3-Dimensional expression of ADAPTED FROM: Screen Shots,
taken from the Jmol online Webpage
protein structures. And is far more dependant on a more complete chemical graph expressed [23].
through the ASN.1 language used in MMDB files, which it is able to animate each structure and allows and can run structural alignments through
the use of VAST, see figure 8.5 (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
14
Jmol like RasMol is a molecular viewer used in bioinformatics, biochemistry and chemistry, like RasMol it is a free open sourced program,
which is java based. Jmol is a multi platform program that able to be run on Windows, Mac, Linux and Unix systems, which make it versatile
to use and is easily incorporated into other java application, it also can be used and accessed via the web. The Jmol program is able to use
and run a variety of molecular file formats, which are: pdb, cif, mol and cml, see figure 8.6 (Attwood T.K. et. al. 1999; Baxevanis, A.D. et.
al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
Another molecular visualization software and tool is the Molecular Biology Toolkit (MBT), this requires additional applications for it to be
run as a program and is not a free standing/running program, but like Jmol and RasMol its arranged libraries are arranged in a hierarchical
form and restricts the affects of classes on molecular components within a graphical protein structure. 2-Dimensional and 3-Dimensional
graphical images and representations can be displayed and be created, using MBT along with Java3D (Attwood T.K. et. al. 1999;
Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
Rasmol and Chime along with some other molecular viewing tools use scripting although MBT does not use scripting, which is a method of
executing for variables and acts as portal in the running of methods and compilation. Scripting allow the use of menus within graphical
viewing programs to enable a user to make changes to the 2-Dimensional and 3-Dimensional graphical images of proteins. Scripting allows
the commanding complex coding to be easily remembered and accessible to the user (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al.
2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005).
15
9. Evaluation
As summarised in the abstract and introduced, the scope of information that is available about protein Structures obtained from Genome
Sequencing Projects is quite vast. Bioinformatics is a whole new region of science, which consists of a wide range of both scientific and
computational aspects that involves scientific experimental determination of protein structures and the analysis of biological sequence
information of DNA, RNA and Peptides. As well the recovery of evolutionary patterns within proteins, prediction of gene function and
biological data mining of information using high powered computational methods.As mentioned within this document the specific section of
bioinformatics that deals with the determination, analysis, retrieval and representation of protein structural information is known as
"Structural Genomics". The “Primary” (Traditional) meaning of Structural Genomics is the: characterize of the physical structure of a
complete genome through the use of gene mapping and sequencing, such as through the Human Genome Project and the subsequent genome
projects such as the: Escherichia coli, Saccharomyces cerevisae, Drosophila melanogaster and Caenorhabditis elegans. The modern
representation and meaning of structural genomics is the: determination of three-dimensional protein structures through the use of genome
sequencing projects. Structural Genomics has two approaches available to enable the prediction of protein structures which then can be
added/submitted to the structural databases.
The first approach focuses on the prediction of a protein’s structure from the same set of protein and enabling the complete visual
representation of a range of protein folds and domain structures. This approach relies heavily on the ideology that protein domains, folds and
extended structures of the Secondary and Tertiary levels of a protein organisation are limited. This approach uses Computer-Based methods,
for these methods to create a functional, accurate and representative computer designed illustration of a protein, there has to be a high degree
of confidence and similarity between the undetermined sequence and the determined protein sequence. With the advancement of
programming and computer technology, there would be a greater effect on not only the success of comparative computational modelling, but
would enable the increase in accuracy and in the illustration & visualisation of computer modelled proteins. Computational prediction and
modelling methods are also dependant on the experimental determination of proteins and the purification of proteins and to enable greater
ease and success in purification of these proteins, proteins from Hyper-Thermophilic bacteria or Archaea and the genetic sequences that code
for them can be also easily replicated and cloned through recombination and transformation techniques using Escherichia coli. The proteins
which are purified using this type of recombination are 3-Dimensionally Determined using either X-Ray Crystallography or NMR
Microscopy. The greater the number and accuracy of determined proteins, allows a greater representation of domain structures and folds
within protein structures and extends the abilities of homology modelling (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999;
Samudrala, R. 2002).
Within this approach, it relies on three broad Structural prediction methods: the first being comparative modelling this method relies
particularly on protein families and locating protein homologues using PDB files and sequences, which uses the identified PDB homologue
as a template. Comparative modelling which is a similarity template modelling technique would contain increasing errors as the similarity
between a sample protein sequence and template sequence decreases. The errors which can occur can include the divergence of peptide side
chains in core protein sequences and are critical when in regions of protein function such as ligands and binding sites, other errors which
occur in the alignment and sequence comparisons involve the distortion or shift in aligned sequence regions causing alternative protein
conformations in small localised regions outside of the alignment segments. These localised distortions in the alignment of localised regions
can be up to 3 A°; this also includes the effects of subunit packing, these effects can be minimized through the use of multiple alignments of
the sample and template sequences. Errors are more frequent in segments that are not-aligned or do not have any templates, which create
inaccuracies within a model. But the largest source of errors that come from using Homology Modelling is from misalignments, in
particularly when identity between the sample sequence and the template sequence falls below 30 percent similarity. To create a viable and
accurate protein model, the conditions of a high standard prediction model have to be met, the first being correct alignment and the second
being the accuracy of the modelling. There are two ways of aligning the sample sequence and the template to reduce the level of errors in a
model. The first is to use multiple alignments and the second way is to “iteratively modify” the regions to enable the prediction of errors in a
model (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). The second structural prediction method is the
Fold (Threading) Recognition Modelling, this is used where there is no or little sequential similarity and uses what is know about structural
conformation based on what is know about each amino acid and the probability and preferences each peptides has for any one secondary
structure through "Fold Recognition". Fold (Threading) Recognition Modelling, determines an unknown structure by how well it fits in
certain sequences models (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). The third type of structural
prediction is Ab initio modelling in comparison with the other methods mentioned in this document is “Ab initio prediction” modelling and is
16
used to build molecular models for any given sequences without using a template and by using minimal energy functions and lattice models.
The advantages of Ab initio modelling is that the mathematical calculations used in creating a protein model are very accurate this is through
the use of the properties that match the most to the experimental data, but can only be used in relation to smaller molecules and is usually
used for molecules that contain 50 atoms (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002).
The second approach of Structural Genomics, is the experimental determination itself, there are three main methods of experimental
determination protein structures. Which are X-Ray Crystallography, NMR Microscopy and Cry-Electron Microscopy, each of these
techniques is time consuming but have there own advantages depending on what type of protein structure is being analysed. The
experimental determination method, X-Ray crystallography, produces high resolution molecular representations at 2Å, X-Ray diffraction
only produces visualisation of molecular structures that are static. Structural representations which are produced using X-Ray diffraction do
not indicate of help to explain functions, structures in crystallised proteins, such as surface loops are seldom detected and as a result several
protein structures incomplete, this is mainly due to the fact that X-Ray Diffraction and is highly dependant on electron density for diffraction
of the X-Rays to produce patterns need to determine the proteins structure (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al.
2005).
X-Ray Crystallography is also quite time consuming and crystals are often difficult to grow, but with the NMR Microscopy this allows the
detection of structures like surface loops in solution and as well as removes the problem of static conformations and only takes a fraction of
the time. NMR allows not only the characterization of macromolecular structures but also their intermolecular interactions as well as
incorporates high spatial and maintains a high temporal resolution. NMR also requires the knowledge of the peptide sequences, but the
protein does not have to be in an ordered crystal, yet high concentrations of solubilised protein must be available (NMR structures are
therefore also called solution structures). In biopolymers, the primary structure (sequence) logically breaks up the molecule into groups of
coupled spins normally one or two groups per residue. This is true not only for proteins, but also for nucleic acids and polysaccharides. A
third technique which is used in the structural determination of proteins is Cryo-Electron Microscopy (CET), this technique freezes protein
samples very rapidly to extremely low temperatures, the low temperatures and rapid freezing of a sample allows the synthesis of highly
ordered sheets that can produce high resolutions of between 5 to 10Å. The technique also enables the depiction of quaternary structures of a
protein and enables the creation of extensive structural information. CET samples like NMR samples are solution based and like NMR
proteins appear in there natural formation. Although sample can be damaged when being blotted, but sample proteins are not distorted when
stained. CET, allows the sample protein to adhere to a grid in a preferential way, to the protein. Cryo – EM, resolutions can be fuzzy due to
lack of absorption of electron beans with in the molecular structure as well as like X-Ray Crstallography sample preparation is quite time
consuming (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005"; Heymann, J. B. et. al 2001; Heymann, J. B. et. al
2007).
The second
Table 03: NMR comparison with X-Ray Crystallography
approach of NMR X-ray crystallography
Structural short time scale, protein folding long time scale, static structure
solution, purity single crystal, purity
Genomics, is the < 20kD, domain any size, domain, complex
experimental functional active site active or inactive
domains Domains
determining atomic nuclei, chemical bonds electron density
resolution limit 2-3.5Å resolution limit 2-3.5Å
itself, there are primary structure must be known primary structure must be know
three main (except if resolution is 2Å or better for every single residue)
methods of experimental determination of protein structures. Which are X-Ray Crystallography, NMR Microscopy and Cry-Electron
Microscopy, each of these techniques is time consuming but have there own advantages depending on what type of protein structure is being
analysed. The experimental determination method, X-Ray crystallography, produces high resolution molecular representations at 2Å, X-Ray
diffraction only produces visualisation of molecular structures that are static. Structural representations which are produced using X-Ray
diffraction do not indicate of help to explain functions, structures in crystallised proteins, such as surface loops are seldom detected and as a
result several protein structures incomplete, this is mainly due to the fact that X-Ray Diffraction and is highly dependant on electron density
for diffraction of the X-Rays to produce patterns need to determine the proteins structure see Table 03 (Branden, C, et. al. 1991; Whitford,
D. 2005; Hames, D. et. al. 2005; Heymann, J. B. et. al 2001; Heymann, J. B. et. al 2007).
17
X-Ray Crystallography is also quite time consuming and crystals are often difficult to grow, but with the NMR Microscopy this allows the
detection of structures like surface loops in solution and as well as removes the problem of static conformations and only takes a fraction of
the time. NMR allows not only the characterization of macromolecular structures but also their intermolecular interactions as well as
incorporates high spatial and maintains a high temporal resolution. NMR also requires the knowledge of the amino acid sequence, but the
protein does not have to be in an ordered crystal, yet high concentrations of solubilised protein must be available (NMR structures are
therefore also called solution structures). In biopolymers, the primary structure (sequence) logically breaks up the molecule into groups of
coupled spins normally one or two groups per residue. This is true not only for proteins, but also for nucleic acids and polysaccharides. A
third technique which is use in the structural determination of proteins is Cryo-Electron Microscopy (CET), this technique freezes protein
samples very rapidly to extremely low temperatures, the low temperatures and rapid freezing of a sample allows the synthesis of highly
ordered sheets that can produce high resolutions of between 5 to 10Å. The technique also enables the depiction of quaternary structures of a
protein and enables to creation of extensive structural information. CET samples like NMR samples are solution based and like NMR
proteins appear in there natural formation. Although sample can be damaged when being blotted, but sample proteins are not distorted when
stained. CET, allows the sample protein to adhere to a grid in a preferential way, to the protein. Cryo – EM, resolutions can be fuzzy due to
lack of absorption of electron beans with in the molecular structure as well as like X-Ray Crstallography sample preparation is quite time
consuming (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005; Heymann, J. B. et. al 2001; Heymann, J. B. et. al
2007).
Bioinformatics databases are split up into several categories, which have been reviewed broadly in section 3. There are three type of database
which is the: Primary, Secondary and Tertiary “Structural Classification” databases, each of which has its importance in bioinformatics and
the determination and prediction of protein structures. The primary databases are used to locate and to match the similarities between a
sample of unknown sequence and sequences that are contained within the database. It allows the rapid identification and classification of
protein sequences, whilst the secondary database however enables more extensive information about the protein structures to be retrieved and
stored. Secondary databases store and maintain structural data of each sequence as well as other derived information which allows the
formation of structural illustrations, these are held normally in files which come in a number of different formats depending on the database
as well as what visualisation software would be used to visualise and illustrate the structural. The Secondary and Structural Classification
databases express in detail the higher level organisation of protein structures including alpha helix, beta sheet and domain/motif structures
that are present in a proteins structure, whereas Primary databases do not contain such information. The Structural Classification databases
go further to allow the comparison between protein structures to search for similarities and enables structural classification of folds,
secondary structures and extended structures. For Structural Genomics the most useful databases and the ones which are primarily used for
comparing as well as visualising structural information of proteins. There are a variety of informational formats that allow the viewer of the
retrieved information from a database to view. As depicted in Section 6 and in Table 02, there are flat files that contain sequential, atomic
and other protein data, but there are more extensive information which depict the visual aspect of a protein these come in table and image
formats that can be altered according the users need and to view either functional or structural information about each protein modelling
(Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002).
SCOP and CATH databases are both structural classification databases, SCOP relies on 30 to 50 percent sequences similarity whilst CATH
relies on a higher level of sequence identity (60 percent). SCOP and CATH both highly dependant automated methods and manual methods,
but the SCOP database’s automatic method tends to be unreliable in the comparison of structural relationships. The CATH database uses the
Enzyme Classification (E.C) system allowing more efficient computational manipulation of data. Both CATH and SCOP are both
hierarchical domain classification systems for proteins which use keyword interrogation system to search the database. Whereas the Protein
Data Bank in comparison to both e SCOP and CATH, not only expresses information “Relational Oriented” and “Object Oriented” formats
but also contains extensive “Flat File” format outputs. Protein Databank files contains residue dictionaries, atom coordinates and sequential
information which are maintained in chemical graphs, as well as holds the details of the authors and a description of the protein. The Protein
Databank has mad use of two different types of file formats (mmCIF and MMDB) which allows the expression of protein structures visually
and contain both “implicit” and “explicit” data. NCBI on the other hand contain an even larger database of sequences as well as structures
like the other three databases NCBI is an online database which is easily accessible through the internet. This database is integrated with
several other databases and is able to compile and retrieve information from each incorporated source. Other than sequential and structural
information as in the PDB database NCBI provides additional information such as chromosome maps as well as the integration of scientific
literature, DNA & Peptide sequence databases, 3D Protein Structure & Domain Data and taxonomic information, but all results are maintain
and expressed (linked) in MMDB Structure Summary formats. Like Protein Databank the majority of protein structures have been
18
determined using either X-ray crystallography or NMR-spectroscopy. Unlike Protein Databank, NCBI is linked to a Vector Alignment
Search Tool, also know as VAST, and enables the detection and determination of structural similarities between 3-Dimentional Structures
modelling (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002).
Protein Databank uses the column orientated “flat” format pdb file to express structural data for each protein and contains highly extensive
level of information about atomic coordinates and other structural properties and is split up into several sections and as stated above pdb files
include information atom coordinates and sequential information which are maintained in chemical graphs, as well as holds the details of the
authors and a description of the protein. In comparison mmCIF is similar in structure to the pdb files in that it is sectioned in to several
blocks of extensive information but in addition to the pdb file, the mmCIF file also contain a residue dictionary which enables structural
validations to take place and therefore allows the synthesis of a more accurate image to form when using molecular viewers. mmCIF as
mentioned above reduces the ambiguity within structural and sequence conformations. NCBI uses a different file known as MMDB, but like
the pdb and mmCIF files it is also highly textural and highly detailed in its atomic and molecular descriptions of each protein but like the pdb
files it does not contain or is linked to residue dictionaries therefore is not able to validate atomic and molecular conformations. Each of these
file formats are used by different visualisation software, the first visualisation software this document has looked at was RasMol, this
program as previously depicted above uses the mmCIF file format to retrieve structural and atomic data but also to express visually with high
levels of conformational accuracy due to the use of residue dictionaries, which enables Rasmol to validate and to calibrate the visualised
images more effectively. Like the other programs for visualising proteins, it is available online and can be readily used to view protein
structures through the web. In comparison with Rasmol, the visualisation software Cn3D translates data files like Rasmol but uses the file
MMDB instead of the mmCIF file. This means that there is know residue dictionary to validate visualised structures, but removes errors in
chemical bonding through the use of the ASN.1 language and allows streamlining and easy access to structural information through web
servers and phone lines. Akthough this would not compensate for the accuracy and the level of validation through the use of the RasMol
software. Jmol is a broad based visualisation software which is able to utilise and run the same scripts as RasMol. Jmol uses not only pdb file
format but also uses and extend range of other file formats including mmCIF files which current RasMol versions are also able to use. Jmol
is broader since it is able to view structural information from other file formats other the pdb and mmCIF files. Jmol like RasMol can
validate and accurately maintain visualised structural conformations, since it is able to access residue libraries using the CIF data files for
protein structures modelling (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002).
19
10. Conclusion
Depending on what a researcher or user of Genomic Databases is looking for, has been touched on by this document. There are extensive
online resources that are freely accessible through the web that not only offers a wide range of facilities but also extensive information.
Primarily with Structural Genomics a lot of the structures that are now available, would not have been there if it was not for the experimental
methods used to determine them, since the computational techniques which are used to model them depend on the experimental methods to
determine structural conformations of proteins. This allows the comparison of structure conformation to peptide/nucleotide sequences and
can be used in the comparison of undetermined sequences with pre-determined structural conformations of known sequences.
Even though each experimental and computational methods have there own merits, the most accurate and reliable experimental techniques
that can be used in conjunction with computational methods are X-Ray Crystallography and NMR Spectroscopy (see Table 03). X-Ray
Crystallography is a method which is better suited for larger proteins that are larger than 20kD and offers a resolution of between 2-3.5Å,
whilst NMR spectroscopy is better suited for smaller protein structures and smaller/isolated domains less than 20kD. But to gain extra
definition within structures within a protein that can not be determined by X-Ray Crystallography, Cryo-Electron Microscopy can be used
show areas of protein that are not easily shown by X-Ray Crystallography, but with less accuracy of between 5-10Å
Each of the predictive, computational modelling methods are used for a set purpose. Homology modelling is used for an undetermined
protein sequence with a known homologue, hence above 30% similarity. If there is less identity than 30%, then the undetermined protein is
put through Fold recognition modelling, where the structure is fitted to a set protein model, if know accurate model is found then a protein
sequence is modelled using Ab initio prediction, which used physical laws to dictate a proteins conformation. The homology modelling is by
far the most accurate out of the three methods with an accuracy of about 3Å and is the quickest and easiest to perform along with the Fold
Recognition which is just as quick to model and is only marginally less accurate [24].
Both the primary and secondary databases can be used in structural modelling, whilst the structural databases contain protein structures. The
best structural classification system would be CATH in comparison with SCOP, since CATH relies on 60 percent identity as well as it uses
the E.C system that allows greater computer manipulation of data classification. Concerning databases that contain actual structural
representations instead of there classification, Protein Databank is the better Structural Database due to the fact that each structural file of
proteins that is held within the database would be able to validate protein structures from residue dictionaries that are in the mmCIF file
format which has been updated from the older pdb file format, which contains identical information on sequences, atomic coordinates,
protein ID’s “identifier code”, name of protein and list of authors along with descriptions about the structure of the protein. Further more the
use of both Rasmol and Jmol with this database give greater versatility in the visualisation of proteins, along with Jmols capabilities to visual
protein structures from other file formats other the pdb and mmCIF (excluding MMDB). NCBI is a far greater tool for similarity
comparisons between protein structures, due the fact that it allows a greater range of data searching and compiling obtaining information
from a larger variety of databases “Global Query Cross-Database” and also enables viewing of both primary and secondary databasess allong
with extensive chromosomal maps.
Both Jmol and Rasmol are the most accurate in visual interpretation software’s in the depiction of protein structures, due to there access to
residue libraries. But Jmol is better able to read multiple file formats and is in that respect more versatile than Rasmol.
20
References
Attwood T.K. and Parry-Smith D.J.; Introduction to Bioinformatics. Longman (1999).
Bae, E., and George N. Phillips, Jr, G.N.; Structures and Analysis of Highly Homologous Psychrophilic, Mesophilic, and Thermophilic
Adenylate Kinases*; The Journal of Biological Chemistry Volume: 279; Number: 27; Page Numbers: 28202–28208 (2004).
Baker, D. and Bonneau R.; Ab Initio protein structure prediction: progress and prospects. Annul. Rev. Biophys. Biomol. Struct. 30, 173
(2001).
Baker, D., Bonneau, R., Chivian, D., Ruczinski, I., Rohl, C., Tsai, J., Strauss, C. E. M.; ROSETTA in CASP4: Progress in Ab Initio protein
structure prediction. Proteins: Structure, Function, and Genetics Suppl 5, 119 (2001).
Baker, D., Sali, A.; Protein structure prediction and structural genomics. Science.294, 93. (2001).
Bates, A.D., Turner, P.C.; McLennan, A.G.; White, M.R.H.; Instant Notes: Molecular Biology (2nd Edition); BIOS Scientific Publishers
(2000).
Baxevanis, A.D. and Ouellette, B.F.F. (eds.); Bioinformatics. A Practical Guide to the Analysis of Genes and Proteins (3rd edition). John
Wiley (2005).
Berg, J.M, Tymoczko, J.L., Stryer, L.; Biochemistry (6th Edition); W.H. Freeman (2007).
Branden, C. and Tooze, J.; Introduction to Protein Structure, Garland Publishing; (1991).
Bourne, P.E. (Editor) and Weissig, H. (Editor); Structural Bioinformatics, WileyEurope (2003).
Bowie, J.U.; Solving the membrane protein folding problem. Nature 438: 581-589 (2005).
Campbell, A.M. and Heyer, L.J.; Discovering Genomics, Proteomics and Bioinformatics. Benjamin Cummings (2007).
Fersht, A.; Structure and Mechanism in Protein Science. W.H.Freeman and Co. (1999).
Gibas, C. and Jambeck, P.; Developing Bioinformatics Computer Skills. O’Reilly and Associates Inc. (2001).
Hames, D.; Hooper, N.; (Third Edition), Instant Notes: Biochemistry, Taylor & Francis (2005).
Heymann, J. B.; Bsoft: image and molecular processing in electron microscopy. Journal of Structural Biology 133 (2-3): 156 – 69 (2001).
Heymann, J. B., and Belnap, D. M.; Bsoft: Image processing and molecular modeling for electron microscopy. Journal of Structural Biology
157: 3 – 18 (2007).
Heymann, J. B., Cardone, G., Winkler, D. C. and Steven, A. C.; Computational resources for cryo-electron tomography in Bsoft. Journal of
Structural Biology in press (2007).
Hickey, G.I., Fletcher, H.L., Winter, P.; Instant notes in Genetics (3rd Edition) Taylor & Francis Group, (2007).
Kane, D.E. and Rayner, M.L.; Fundamental Concepts of Bioinformatics. Benjamin Cummings (2003).
Kleanthous, C. (ed.); Protein-protein Recognition. Frontiers in Molecular Biology. Oxford University Press (2000).
Leach, A.R.; Molecular Modelling. Principles and Applications (2nd edition). Longman (2001).
Lesk, A.; Introduction to Bioinformatics (2nd Edition), Oxford University Press (2005).
Moult, J.; Predicting protein three-dimensional structure. Current Opinion in Biotechnology 10 (6) 583-588 (1999).
Patrick, G.L.; Organic chemistry (2nd Edition), Taylor & Francis Group, (2004).
21
Petsko, G.A. and Ringe, D.; Protein Structure and Function. New Science Press Ltd (2004).
Samudrala, R.; Modeling genome structure and function; Pure Appl. Chem., Vol. 74, No. 6, pp. 907–914 (2002).
Turner, P.; Molecular biology (3rd Edition), Taylor & Francis, (2005).
Westhead, D.R., Parish, J.H. and Twyman, R.M.; Instant Notes: Bioinformatics. BIOS Scientific Publishers (2002).
Zubay, G.L.; Biochemistry (4th Edition), Wm. C. Brown Publishers (1998).
1. http://www.chem.ucsb.edu/~kalju/chem110L/public/tutorial/images/
2. http://www.langara.bc.ca/biology/mario/Biol2315notes/biol2315chap3.html
3. http://kentsimmons.uwinnipeg.ca/cm1504/proteins.htm
4. http://research.yale.edu/ysm/article.jsp?articleID=51
5. http://bouman.chem.georgetown.edu/nmr/protein.htm
6. http://www.bnl.gov/bnlweb/pubaf/pr/PR_display.asp?prID=07-73
7. http://www.rpc.msoe.edu/cbm2/gfp1.htm
8. http://www.cryst.bbk.ac.uk/PPS2/projects/vun/MHC_master.htm
9. http://www.med.unibs.it/~marchesi/pps97/course/section9/9_term.html
10. http://bioinfo.ssu.ac.kr/bbs/zboard.php?id=link_new&page=1&category=&sn=off&ss=on&sc=on&keyword=&prev_no=&sn1=&divp
age=1
11. http://www.ctwatch.org/quarterly/print.php?p=83
12. http://genome.gsc.riken.go.jp/hgmis/posters/chromosome/pdb.html
13. http://home.cc.umanitoba.ca/~psgendb/GDE/dataset/dataset.html
14. http://chemweb.calpoly.edu/llindert/313-structure-tutorial.html
15. http://florey.biosci.uq.edu.au/Subjects/BC327/Material/
16. http://jmol.sourceforge.net/
17. http://www.umass.edu/microbio/rasmol/
18. http://mbt.sdsc.edu/
19. http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml
20. http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3dtut.shtml
21. http://www.bioinformaticscourses.com/ISB/sp2003/2TSC/
22. http://biol.lf1.cuni.cz/ucebnice/pohlavi.htm
23. http://jmol.sourceforge.net/screenshots/
24. http://lectures.molgen.mpg.de/Algorithmische_Bioinformatik_WS0405/material/Steinke_lecture_19_1.pdf
22
Evaluation of Protein Structure Information from Ge more
Evaluation of Protein Structure Information from Genome Sequencing Projects: To review the types of information available as well as an insight into the use of genome projects, sequencing, online databases and the use of bioinformatics to access protein structural information. less
0 comments
Post a comment