Protein Structure Project

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Protein Structure Project - Presentation Transcript

    1. PROJECT Evaluation of Information about Protein Structures from Genome Sequencing Projects PPS 2007/08 PG Cert. Principles of Protein Structure Birkbeck, University Of London Sami El-Sabbahy ID Number: 12408964
    2. Contents SECTION PAGE Abstract. 1 1. Introduction. 2 2. Protein Structure. 3–4 3. Protein Structural Resources & Databases. 5 4. Experimental Protein Determination. 6 5. Structural Modeling. 7–8 6. Structural Genomic Databases. 9 – 11 7. 3-Dimentional Structural Data. 12 8. Structural Visualisation. 13 – 15 9. Evaluation. 16 –19 10. Conclusion. 20 References. 21– 22
    3. Abstract Protein Structures are determined, sequenced and visualised from Genome Sequencing Projects is commonly referred to as Structural Genomics. Structural Genomics is a specific area of bioinformatics, which uses a mixture experimentally determined sequential information of nucleic/polypeptide sequences and 3-Dimensional protein structures along with high levels of computational prediction to build 3- Dimentional Structures of unknown/undetermined proteins. In this document how the structures are stored, where they are stored and how the protein structures are predicted are reviewed as well as what types of information about protein structures are contained and how they are expressed within databases, which will be examined broadly and discussed. 1
    4. 1. Introduction The genome project is the mapping and sequencing the genome of not only humans but also of other organisms. Part of this sequencing is “Bioinformatics” is the study, design and the use of computational and mathematical tools to process biologically-derived data. In the 1980’s the US government department of energy commissioned the first genome project that mapped the physical and genetic aspects of the human genome. After which saw the formation of three sequencing centres, which are DNA Databank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL) and the American GenBank at NCBI. The primary function of these three organisations was to create and to maintain Nucleotide sequences Databases, Protein sequence Databases, Structural Classification Databases, Sequence Alignment and Database Searching, Protein Structure, RNA Structure, Protein Structure Prediction/Modelling and Phylogeny and Biodiversity. The original genome project was geared towards the human genome, which was expanded to include such organisms such as Escherichia coli, Saccharomyces cerevisae, Drosophila melanogaster, Caenorhabditis elegans. Within the realms of bioinformatics and the genome project there is extensive information about protein structure, polypeptide, RNA and DNA sequences. For the last couple of decades the number and amount of information about these sequences and structures have increased exponentially. Protein Structures are made from folded polypeptide chains, which are made from several different sequences of 20 different peptide units “amino acids”. The amino acid sequences forms a polymer chain, known as the “Primary Structure”. The sequences of amino acid residues fold into a specific conformation, through atomic bonding and molecular interactions, which consist of loops, turns, α Helices and β Sheets known as the “Secondary Structure”. The way and how the α Helices and β Sheets fit and build a more extensive structure is known as the “Tertiary Structure”. When there is more than one polypeptide in a protein, the sequence is bound together through covalent bonding, this is known as the “Quaternary Structure”. Peptides are the basic unit of a protein, these units contain two functional group which are the Amino group (NH2) and a Carboxyl group (COOH) along with an organic group, also known as the R group, which not only varies in it size as well as chemical and molecular properties. These are groups are connected by a central Carbon atom known as the α carbon. Amino acids (peptides) form chain, sequences of any number and varying amino acids residues, which I connect, as said above, through covalent bonds known as peptide bonds. The first three dimensional structure that was determined, was that of Deoxyribonucleic Acid (DNA), by James Watson and Francis Crick using X-Ray Diffraction by Rosalind Franklin and Maurice Wilkins in 1953. Protein structural analysis is done in one of two which are either through experimental techniques (structural determination) or modelling techniques (structural prediction, these techniques are not only to determine a proteins structure but also to visualise the structure of proteins. The experimental techniques that are involved in determining the protein structure are as follows: X-Ray Diffraction (crystallography), Nuclear Magnetic Resonance (NMR) and Cryo-Electron Microscopy (CET). The structural predictive techniques that are used to fast track the determination and modeling of protein structures are Ab Initio modelling methods and Homology modelling methods. The first protein to be sequenced was Insulin (hormone) in 1955 and the first enzyme was Ribonuclease in 1960. The first method of Francis Crick and James D. Watson. sequence determination was through Edman Degradation (Dansylation) which was ADAPTED FROM: Chemistry & Biochemistry Department, University of California [1]. later transferred to Mass Spectrometry in 1979. The techniques are used to determine the sequence of polypeptide sequences, when the sequences have been determined the way the sequence folds and the peptide side chain interactions can be determined and the molecular dynamics energy minimization and mechanics need to betaken into account. There are a variety of software which is used to visualise protein structures these structures are maintained and stored within the databases which have been stated above. The use of genomes for the purposes of determining the structure of molecules such as proteins is known as “Structural Genomics” (SG). Structural Genomics through structural biology has enabled the discovery of relationships between amino acid sequences and protein structures and allowed information and concepts about protein family, fold, and super family to be developed. This has further enabled the detailing of taxonomies understanding of the three-dimensional shapes of proteins. 2
    5. 2. Protein Structures A protein is a polymer chain that is built from monomers units known as amino acids. A proteins structure as well as its function is determined by the sequence and properties of the monomeric sequence of peptides. In a proteins structure there are four successive levels of its organisation, which is the: primary (1°), secondary (2°), tertiary (3°), and quaternary (4°). As stated above the primary structure of a protein is a linear sequence of peptides (monomer) in a polypeptide chain (polymer). The secondary structure is the individual geometric formation that is created from the polymeric chain. The tertiary structure is the folding that occurs with each secondary structure, whilst the quaternary structure is the organization of protein subunits (Branden, C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004). The monomeric units of a polymer chain of a protein, is the peptide which has five separate constitutes which are the: central alpha carbon atom and four substituent units that are connected to it, these are the alpha proton -H, the side chain –R, the carboxylic acid functional group (-COOH) and the amino functional group (-NH). With the exception of Glycine all alpha carbons are asymmetric, each peptide containing the asymmetric α carbon atom is in an L-isomer, see figure 2.1 (Branden, C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004). Figure 2.1 - Amino Acid Molecular Structure: Shows a simplistic diagram of structural componants and functional groups an amino acid. ADAPTED FROM: the Protein ChemCards, Chemistry Department at Hibbing Community College [2]. Polypeptide chains are formed through condensation reactions, which occur when water is produced, when the amino group of one peptide reacts and bonds to the carboxyl group of another, forming a covalent C-N bond. The use of primary databases is used to match polypeptide sequences to information contained within the primary databases. The primary database contains sequential information derived primarily from DNA/RNA sequence analysis, see figure 2.2 (Branden, C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004). Figure 2.2 - Amino Acid Molecular Structure: Shows the covalent bonding that occures between the amino group of one amino acid with the carboxyl group of another amino acid. ADAPTED FROM: Department of Biology, University of Winnipeg [3]. 3
    6. The secondary structure is the spatial arrangement of a segment of a polypeptide sequence. There are three major structural conformations that commonly occur in the secondary structure, which are the alpha helices, beta sheets and turns. Structural conformations in the secondary structure when all Φ bond angles in that polypeptide segment are equal to each other, and all the ψ bond angles are equal. The alpha and beta helix structures are thermodynamically stable, whereas some selected amino acids support the turns. The conformation of a proteins secondary structure is dependent on the properties of the sequence and the amino acids within the sequence, see figure 2.3 (Branden, C, et. al. Figure 2.3 - Secondary α and β structures: Shows the 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005; two most common secondary structures in a protein being the αand β. ADAPTED FROM: Department of Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004). Biology, University of Winnipeg [3]. Within the secondary structure there are distortions which occur, such as alpha helical curvature which is due to the amino acid bonding of CO and NH which form hydrogen bonds with amino acids 3 residues along which produces the 310 helical structure (3 is the number of amino acids between the CO and NH hydrogen bond, whereas the 10 is the number of atom contained within the ring). Within the protein structure, certain proteins contain an additional hierarchy of secondary structural organisation; this is the ordered set up of the secondary structure known as the super secondary structure. The ordered organisation of the secondary structure, forms structurally functional sections of the protein, known as motifs such as the: Helix-Turn-Helix, Leucine Zipper, Helix-Loop-Helix and Zinc Finger domains (Branden, C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004). The tertiary protein structure refers to the three dimensional units of the structure a protein, it relates to the relationship between the spatial parameters of the secondary structure of one polypeptide and the spatial parameters of the secondary structure of a different polypeptide and how they fold. The tertiary structure primarily relates to the interactions between numerous domain/motifs through hydrogen bonding, hydrophobic interactions, electrostatic interactions and Van Der Waals forces (Branden, C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004). The classification of domains such as; all α domains (contain only α- helices and folds), all β domains (contain only β-sheet parallel/anti- parallel directions) such as the Greek Key motif, α+β domains (containing both all α & all β domains) and α/β domains (contain β-α-β motifs). The all α-helix, all β-sheets and α/β domain classification are incorporated into the CATH domain database, whilst the α+β domains are not, there is extensive overlapping of structures in this domain. Structural alignment is also used a structural database tool to determine and “classify” the domains of an unknown protein structure, see figure 2.4 (Branden, C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. Figure 2.4 - Tertiary Structure: shows the interactions 2004). that take place in the secondary structure in forming the tertiary structure. ADAPTED FROM: Department of Biology, University of Winnipeg [3]. The Quaternary structure describes the structure of proteins that contain numerous subunits (multiple polypeptide monomers). The quaternary structure is basically the arrangement all the monomeric units, within the three dimensional structure of the protein, the best example would be that of Hemoglobin. Hemoglobin is composed of four monomeric units. Monomeric units are either identical (homo) of different (hetero), therefore a multimeric/oligomeric protein with identical monomeric units is called “Homomer” whilst a protein with different monomeric units is called a “hetromer” (Branden, C, et. al. 1991; Zubay, G.L. 1998; Berg, J.M. et. al. 2007; Whitford, D. 2005; Hames, D. et. al. 2005; Petsko, G.A. et. al. 2004). 4
    7. 3. Protein Structural Resources & Databases The main aim of the genome project was to map the sequence Table 01: Genomic Databases and physical parameters of genomes, bioinformatics made it Primary Databases Secondary Databases possible to computerize, archive and retrieve sequential and Nucleic Acid Protein Protein EMBL PIR PROSITE structural information. There exist now several databases which GenBank MIPS Pfam contain sequential information of nucleic acids and peptides as DDBJ SWISS-PROT SCOP TrEMBL CATH well as a detailed catalogue of protein structures. There are two NRL-3D Protein Databank main databases, which are primary and secondary databases (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Primary database are archival, which contains information derived from experimental analysis, which contain unprocessed sequence data in which there are three main databases which are the are DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and the American GenBank at NCBI. The primary database mainly contains sequences of nucleotides/peptides, which are annotated using letters, in which the most common format is CDS. Protein sequence databases include Swiss-Prot & PIR and for genome/nucleotide sequence databases include GenBank & DDBJ. The primary protein sequence databases, includes UNIPROT (Universal Protein Resource), which is wide-ranging catalogue protein information. The information that is created maintained and contained within the catalogue come from the databases Swiss-Prot, TrEMBL, and PIR. UNIPROT/TrEMBL can be accessed and used through ExPASy (Expert Protein Analysis System) Proteomatics Server. UniProtKB/Swiss-Prot (Protein Database) contains all the translated nucleotide sequence entries of EMBL that are not integrated into Swiss-Prot, in which TrEMBL contains 232,345 entries whereas UniProtKB/TrEMBL contain computer-annotated supplement sequence entries, which contains translations of EMBL nucleotide sequence entries Swiss-Prot (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Secondary databases are curatorial, which contain derived information that contains a review of the relevant information. The Secondary Databases contain structural and sequential information about sequences which have been extensively transcribed. A secondary database contains derived information from the primary database. A secondary sequence database contains information like the conserved sequence, signature sequence and active site residues of the protein families arrived by multiple sequence alignment of a set of related proteins. These sequences include data entries about secondary structures of proteins, which are classified by motif and domain structure like all alpha proteins, all beta proteins, etc. There are several structural databases formed and maintained by individual laboratories these include SCOP (Cambridge University), CATH (University College London), PROSITE (Swiss Institute of Bioinformatics) and eMOTIF (Stanford). Secondary protein databases are pattern databases, which uses multiple alignments of homologous sequences, there are several different secondary databases. These databases store different information about protein structures, the first database we will look at is PROSITE: this database uses the primary database SWISS-PROT as its major source of information. The patterns and entries that are generated using PROSITE are short patterns. Like SWISS-PROT, Pfam is a database, which has a large number of multiple sequence alignments. Pfam uses Markov Models to create protein family or domain signatures. There are two forms of alignment, which are created and stored, the first being Pfam-A are edited by hand and are accurate. Pfam-B is derived from automatic clustering of SWISS-PROT and is less reliable. A third secondary database is known as SMART, it is a Molecular Architecture Research Tool, which expresses in visual terms the domains that are present within protein sequences (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Structural Classification Databases enables the classification of the three dimensional structures of proteins that exist and allows the comparison of structural similarities and differences. In general proteins with a similar sequences and functions will adopt a analogous overall three dimensional structure – predominantly for the crucial active site residues. There are two Structural Classification database schemes, which are: CATH and SCOP. In addition to this there are composite database, which uses several primary database, this type of database search multiple resources. There are, a multitude of diverse composite database, which not only use different primary database but also uses different search criteria. One such composite database is the National Centre for Biotechnology Information (NCBI), which not only contains nucleotide and protein databases, hosted my massive and openly available computer servers, it is also linked to OMIM (Online Mendelian Inheritance in Man) (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). 5
    8. 4. Experimental Protein Determination To determine the structure as mentioned in the introduction, the first experimental technique which is X-Ray Crystallography, which requires the purification and growth of protein crystals. The protein crystals are aligned with a high level of accuracy and made into rigid structures and mounted on a Goniometer then X-Ray beams, of wave length 0.1 to 0.2nm, are passed through the crystalline structure. The X- Rays on passing through the crystalline protein structure would scatter and reflect (the reflection can be determined by the use of Bragg’s Law) which would produce and diffraction pattern on a photographic film. The 3-Dimentional structure can be determined through the intensities of the dark spots on the film, known as “diffraction maxima” and are taken at different rotations round the crystal, which are used to mathematically calculate and construct the 3-Dimentional structure, Figure 4.1 – X-Ray Crystallography: shows the “diffraction maxima” contain within the diffraction using Fourier Transforms. X-Ray Crystalline structures are useful for patterns. ADAPTED FROM: The Yale Scientific accounting for electronic and elastic proteins as well as it is able to help Magazine [4]. determine chemical interactions of compounds, see figure 4.1 (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005). The Second experimental technique the Nuclear Magnetic Resonance (NMR) Spectroscopy, this method is used to determine relatively small protein structures of about 30KDa. With this technique, proteins are maintained with aqueous solution, in which none of the proteins need to be crystallized, instead of using X-Ray diffraction this technique uses magnetic fields and radio frequencies to determine a proteins structure. The structure is determined through the differences between the spin of each atom in a peptide chain and each atom acts differently depending on which atom(s) it is bonded with and the closest peptide residues. A proteins 3-Dimentional structure is determined from the magnitude of the effect; the distances can be calculated and is used to generate a structure representation. The result of NMR Spectroscopy is the NMR spectrum (see figure 4.2), which shows linear signals representing each Figure 4.2 – NMR Microscopy: shows graphical representation of Ubiquitin obtained from NMR. compound/amino acid. The concentration of each amino acid within a ADAPTED FROM: The Department of Chemistry, protein varies and therefore the characteristics of each linear peak Georgetown University [5]. changes, within the spectrum curve you see on the left (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005). The Third experimental technique used in determining the structure of proteins is Cryoelectron Microscopy, this techniques is primarily used to determine the 3-Dimentional structure of multi-subunit proteins as well as proteins that are not easily crystallized. This technique a protein sample is rapidly frozen in liquid helium and the sample is then examined under the Cryoelectron Microscope using lose dose electrons. All images are then analysed using complex computer programs which reconstructs the protein structure into a 3-Dimentional Image. The Figure 4.2 – Cryo-Electron Microscopy: shows a colourful wheel images are computer-generated models of the micrograph of an enzyme obtained from Cryo-Electron Microscopy. ADAPTED FROM: The Brookhaven molecular structure of the protein, which is superimposed over the National Laboratory [6]. electron micrograph where the proteins are located in the array, see figure 4.5 (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005). 6
    9. 5. Structural Modeling Molecular Modelling is used to predict the 3-Dimensional structure of proteins to a high degree of confidence, when the sequence is compared with proteins that have had there structure, experimentally determined. There are many types of molecular modelling techniques, which enable the prediction and determination of the three dimensional structure of proteins. The 3-Dimensional structure of a protein is vary important since it contains the information about biochemical function through binding sites, catalytic activity and the interactions between the protein and other molecules such as the interaction with between proteins; proteins and nucleotide DNA/RNA and proteins & ligands. The identification and the subsequent visualisation of ligand binding site allow the design and synthesis of new drugs, currently there are several projects which are running. The size and shape as well as the three-dimensional geometry of each ligand can be visualised using the software tool and can be used to see the interactions between a protein and ligand “known as the lock and key”. Within protein prediction, globular proteins their structures are far more easily determined through computational prediction, whilst experimental determination for all proteins remains difficult regardless of structural conformation (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). There are three main computational prediction techniques used to determine the structure of proteins, which are: The first being Comparative (homology) modelling, which predicts an unknown proteins structure from its sequence, by aligning the sequences with that of a known and experimentally determined protein. A structure is then modelled based on the strength of the similarity of the unknown sequence with that of the known sequence. The second computational prediction technique is: Fold recognition (threading) modelling is used if there are no known homologues of the unknown protein structure/sequence. This method compares the unknown sequences of an unknown protein with that of known protein folds. This uses a score system, sections of the unknown sequence is then scored against known folds. The third computational prediction technique is: the Ab Initio method, which determines the structure from first principles without any reference to protein structures. Then modelled from empirical/semi-empirical, experimental results using models of atoms and related molecules. The characterization of the functional purpose of a protein is difficult, which is why accurate three-dimensional structures of proteins are produced using the methods of modelling which have been mentioned above and are described in detail below. The structure of protein is both determined in nature by the laws of physics and the theory of evolution (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). Comparative or Homology modelling uses known parental protein structures to build protein structures from sequential and structural comparison techniques, which have four main stages to create a viable protein structure. A polypeptide sequence is first of all aligned with its parent polypeptide sequence as well as other homologous sequences of the same origins. A primary “framework” structure is formed using the parent structure for the new polypeptide sequence, then additional structural loops and folds as well as extended structure are modelled, which is then refined using side chain geometry and packing. This technique is highly dependant on not only the accuracy of the alignment of the sequences but also the extent or the level at which each sequence is related. Homology modelling identifies protein structures that are similar to the target protein through sequence comparison. The quality of homology modelling depends on protein structure, i.e. motif and domain (including helical and strand) sequence similarity. This technique revolves around the idea that protein can be related and can evolve to for families and have a distinct origin, hence Orthologus proteins. This modelling uses prediction methods, which has four stages, the first being to identify structural templates from a protein structure databases, the second stage is to use alignment tools to obtain structural templates. The third stage would be to build a backbone and then lastly to incorporate the side-chains. The homology method determines a sequence similarity by aligning the sequences optimally. The aligned residues of amino acids in a polypeptide sequence of a proteins structure is used to create a model. The better the comparison and the higher quality of the alignment the better the accuracy of the model created. Another factor is the ability to determine and to detect, if there are homologous (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). This technique of modelling is best suited for sequential information, which has 50 percent of more similarity, but inaccuracies can appear or become apparent during extrapolation and detailing of side chain positions as well as insertion of loops and extended structures of a protein were segment of the sequential information does not match with parental sequences and structures. Homology modelling requires an unsolved sequence to be inputted with a template “parent” sequence which has a high level of identity which has been determined through X- Ray Crystallography or NMR, which both sequences are then aligned. The identity of the sequential information between the unknown sequence and the template sequence, when aligned, must contain within both sequence backbones identical positions of the α-carbon as well
    10. as the identical phi and psi angles and secondary structure. Searched for template sequences high identical sequences can be done through using a database such as BLAST which then can be compared with structural/sequence information from PROTIEN DATA BANK. Blast contains information of sequences including the level of similarity such as the “E-Value” and “P Value (probability)” in which low values of P suggest that there are important biological matches of significance. Whilst Protein Data Bank contain detailed structural information of proteins that have been determined through either X-ray crystallography or Nuclear Magnetic Resonance. The unknown sequences can be compared with template sequences using SWISS-MODEL (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). The second method is: Threading Recognition is used instead of Comparative (Homology) modelling, when there are no known homology structures that match the sequence. This type of modelling goes on the basis that the 3-D structures are conserved and that protein sequence normally adopts folds which are similar, even without similarity within structure and/or function. Fold recognition aligns and scores unknown sequences against the complete library of structural templates. It compares not only how they and the way they fold but also how each structure would fit the sequence. It detects similarities across a modelled sequence and a known structure(s). This can only be done if at least one of the proteins in the a protein family to be experimentally determined through X-Ray Diffraction, Nuclear Magnetic Resonance (NMR) or Cryo-Electron Microscopy (CET). So to align the undetermined or unknown protein/sequence with the known structure, as stated above this is also required for Homology Modelling (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). Ab initio modelling of a protein structure predicts the 3-D structure, using the thermodynamic hypothesis of protein folding, the thermodynamic protein sequence corresponds to its global free energy minimum state. In which there are several different ways of modelling through this method of prediction the being Rosetta and CASP (Critical Assessment of Techniques for Protein Structure Prediction). Ab Initio methods are used when there are no templates using Homology are found, it is used to predict the protein structure. The method use for this predictive modelling is as follows, which comes in several steps the first being define protein structure and conformational space in a representation, which is then followed by representation of the energy functions with the protein structure. The third step of this visualisation method would be to minimize energy functions. The final minimal energy conformation visualized would be considered to be the actual structure of a native protein in normal surroundings. The folding of the protein is dictated by these physical forces that act between molecules in every molecular structure. The de novo modelling method assume that the native structure corresponds to the global free energy minimum accessible during the lifespan of the protein and attempt to find this minimum by an exploration of many conceivable protein conformations. The two key components of de novo methods are the procedure for efficiently carrying out the conformational search, and the free energy function used for evaluating possible conformations (Baker, D. et. al. 2001). 8
    11. 6. Structural Genomic Databases The main purpose of genomic sequence and structural Table 02: Genomic Databases Information databases is to be an archive, which is compact, durable Database Storage Information Stored Flat Files Sequence and standardised. These databases have two main Type of Relational Oriented 2 Dimensional functions the first being the retrieval of sequences that Data Stored Database (Tables) Structure Images Object Oriented 3 Dimensional have been submitted directly into the database and Database (Images) Images second the interpretation of each sequence through assigning functions to each section of the sequence as well as the elimination of artefacts. All the databases which are currently available can be accessed via the web, as stated on numerous occasions throughout this project there are several database which can be accessed, which are used not only to determine the structure of proteins but also the functions. Such as the Hierarchy of conservation of each protein, which includes Hydrophobic Packing, Active Sites and Surface Residues ; Amino Acid Propensity; Globular & Functional Domains; Peptide Backbone Conformation and Amino Acid Packing. The types of databases that have be reviewed above have different forms of information stored which is dependant the manner of the Database Storage or nature of Information Stored (see Table 02: Genomic Databases Information) (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). As discussed above there are several formats of information about Genomic Databases as well as there are different types of databases. Which are used to not only to store information, but to also determine and to visualise a variety of aspects relating to structure and sequential information. The generalised idea of having secondary and tertiary/composite genomic databases, is to enable the determination through high-throughput of the three-dimensional structure and the analysis of other “biological” macromolecules, which included the classification and structural make up of single protein domains as well as the determination of the relationship between polypeptide sequences and the protein structures, across a range of different proteins. This includes the classification of protein families, folds and super-families, as well as detailing of taxonomies (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Protein structures and relationships are visually classified using the SCOP database which is accessible and maintained by the MRC, which classifies globular proteins, using several hierarchal sections, which are the: class, folds, super-family, and family. The class refers to the general structural architecture of each domain. The fold refers to the similarities and common aspects between secondary structures which exhibit the same topology regardless of evolutionary origin. The super-family refers to proteins, which a little of low identical sequential but exhibits related structural and functional similarities. Proteins are placed into a defined family if there is more than 30 to 50 percent sequential similarity or identity. The structural classes which are encompassed within the SCOP database includes (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005): o Mainly α. o Mainly β. o Alternating α-β (either α+β or α/β). o Multi-Domain proteins. o Membrane & Cell surface proteins or Peptides. A second visual protein classification system database is CATH (Class, Architecture, Topology, Homology) which is accessible via the web and is maintained by UCL. This classification system relies on and uses to a greater to extent automated methods and only used manual inspection techniques when automated methods do not obtain results. Within this classification database there or five separate levels of classification, which are: Class, Architecture, Topology, Homology and Sequence. The class of a protein is determined through the secondary structure of the protein through the packaging of α-helix and β-sheets in the formation of domains; which there are four types of packing, which are (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005): 9
    12. o Mainly α o Mainly β o Alternating α-β (either α+β or α/β) o Limited Secondary Structure Figure 6.1 - Domain Classes: Shows domains containing different secondary structures either being all α, all α or being a mixture of α and β. ADAPTED FROM: The Principles of Protein Structure '97, Birkbeck College - University of London [9]. The second level of architecture refers to specific arrangements of each secondary structure, hence what connects them, which also described the motifs that are created. The Topology refers to and describes the shape and how secondary structures connect. This is done through structural comparisons through clustering of domains, where 60 percent or more of the protein structure has to be identical. The Homology hierarchy level groups domains that have greater than 35 percent identical sequences. The fifth level of hierarchy which is the sequence: This matches structures with greater than 35 percent sequence identity.There are other classifications such as the CATH database; both SCOP and CATH both use manual inspection techniques and automated methods. These are used to differentiate between similar Analog’s and Homolog’s. Manual techniques are used to separate between protein groups which exhibit similar structures and functions. Both SCOP and CATH are effective ways of deriving both the specific structure and function and there relationships of proteins. The two databases enable structural alignments to be carried out as well as the functional inferences to be expressed. And allows the visualisation of common sequence features to be expressed certain topologies (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Protein Data Bank (PDB) is a specific protein structure database, which is maintained by Brookhaven National Laboratories, it’s primary function is the submission and retrieval of all protein structures and offers 3D structural Data for both nucleic acid carbohydrates and polypeptides. The structures that are submitted and are obtained from the vast database are mostly experimentally determined through X-Ray Crystallography and NMR, which can be access by a public domain through the World Wide Web. The PDB files give a full description of the 3-Dimensional structure of each protein continued within its database, which comes in a text format and is column oriented (Click Here) along with other molecules such as water and drug compounds as well as ions. Two newer chemical file formats have been created, which are mmCIF and MMDB and both contain data description languages. Each file contains atomic coordinates of each atom along with annotations, comments as well as experimental details (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Figure 6.2 - Protien Databank Figure 6.3 - Protien Databank Figure 6.4 - Protien Databank Figure 6.5 - Protien Main Page: ADAPTED Search: ADAPTED FROM - Search Results: ADAPTED Databank Text Format: FROM - Department of Cyberinfrastructure FROM - Chemistry Department, ADAPTED FROM - School Bioinformatic & Life Sciences, Technology Watch (CTWatch) California Polytechnic State of Molecular and Microbial Soonsll University [10]. [11]. University [14]. Sciences, University of Queensland [15]. Within the PDB database, as depicted in the three figures above, contain not only information rich “text” information on each sequences it also contains offers 3D structural images which can be viewed through visualisation software that is attached to each protein record in the database. There are two types of information contained within this database on each sequence “implicit” and “explicit” which enables the 10
    13. construction of a three dimensional protein structure. Each record “protein” that is contained within the database contains a three letter code. The structural data bio-molecules “proteins”, which can be visualized, are visually represented through software such as: VMD, RasMol, PyMOL, Jmol, MDL Chime and MBT Protein Workshop/Simple Viewer (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). NCBI is one of the largest and accessible databases, which can be accessed by the public through the World Wide Web. It hosts several different types of databases, in which one of them is: Entrez, which provides structural information of proteins, by searching and compiling data from sources such as SwissProt, PIR, PRF, PDB, and translations from GenBank and RefSeq. Entrez is Global Query Cross- Database, which is a powerful search engine that allows a user to search and retrieve structural, sequential and reference information from each database contained/linked to the NCBI website. It allows the viewing of both gene and protein sequences Figure 6.6 - Entrez Database: Showing the database browsers. along with chromosome maps as well as it integrates ADAPTED FROM: Biological Research Computer Hierarchy (BIRCH), information from scientific literature, DNA & Peptide University of Manitoba [13]. sequence databases, 3D Protein Structure & Domain Data and taxonomic information to create a highly adaptive and connected system of information. This online report will first focus on Entrez's Structure Index, which is an NCBI homepage that specifically relates to visualisation and retrieval of the 3-Dimensional structure(s) of each protein (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Each protein structure from a query lists all the results with a PDB number and names as well a generalized description of the protein in each result. There are a list of links to protein, which accompanies each query result, these are the: MMDB Structure Summary page and to Entrez; 3D Domains Index, Protein/Nucleotide Index, PubMed Citations Index & Entrez Taxonomy Index. The Entrez's 3D-Domain Index page of the NCBI home page is to retrieve 3D domain information from domain queries. Each query would contain a list of domain names including general descriptions about structures of each domain. Each 3-Dimensional Domain result is linked to an MMDB Structure Summary page as well as to a VAST 3D-Domain Neighbours Summary along with a link to the: Entrez Structure Index, Entrez Protein, Nucleotide Index, Entrez PubMed Citations Index and/or Entrez Taxonomy Index. The Entrez database like the Protein Data Bank database contains experimentally determined three-dimensional structures, which were determined through either X-ray crystallography or NMR- spectroscopy (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). VAST (Vector Alignment Search Tool) is a tool that enables the user search and locate the structural similarities between protein domains. By searching through 3Dimensional-Domain Database and locating similar secondary structural arrangements. Each secondary unit or domain is represented in the VAST database as a vector which is then aligned which can be performed on the inputted Protein Data Bank files. The current Entrez database connects each of the 3-Dimensional domains to a list of polypeptides, which are also linked to related “homologous” 3-Dimensional domains. Each 3-Dimensional vector element is derived solely from a protein’s secondary structure; no sequential information is used during each search and is able to detect similarities between structures even without sequential similarities. VAST as an alignment and database search tool is useful in the investigation of the relationship between the 3-Dimentional structures of proteins in particular with the use of SCOP (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). 11
    14. 7. 3-Dimentional Structural Data All 3-Dimentional Data is recorded as ball and stick model, which included details and dimensions of each atom in a ball and stick model. This can be firstly obtained through the sequence of either nucleic acids or peptides by drawing and determining a 3-dimentional model of the backbone of any given sequence. For each polypeptide sequence, the sequences would be determined always from the N-terminus also known as the amino-terminus, NH2-terminus or Amine-terminus and by comparing each peptide from the structural composition, conformation & orientation of the twenty most common amino acids, using a “residue library”. The chemical structure of the polypeptide sequences is recorded and then the 3-Dimensional structural data would be measured and established through the measurement (in angstroms) of each atom starting from the N-Terminus. Through this, the coordinates of each atom of the polypeptide chain on the x, y and z axis is calculated and then recorded. Each structural database would not only store (archive) and maintain such records of each protein molecule, which then can be retrieved through a accessible public web based sever, see Figure 6.5 (Click Here) (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Regardless of the format of the file information is stared in, each format would contain structural coordinates which includes the spatial locations of each individual atom within a protein along each dimensional axis x, y and z. In addition each atomic coordinate is labeled with the element, residue and molecule the coordinate belongs and is known as a “chemical graph”. Chemical graphs like the creation of ball and stick models uses a residue library of all twenty most common amino acids, it also contains tables of atom types and bond information (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Molecular Visualization relies heavily on computer graphics, but also on computational prediction techniques and modelling, as outlined in Section 4 of this document. The any of the specialised software which is used, all perform the same function of creating 3-Dimensional pictures that can be rotated as well as be altered to show specific peptides, sequences and structures within a protein. Each protein visualization representation of a protein is done by “connecting the dots” which are done by using two different “minimalist approach” approaches, which are used in relation to the storage of information about the bonding between atoms/molecules, the physical rules of chemistry are always observed. The first approach is the e “legacy approach” also known as the “chemistry rules approach”. This approach does not use residue dictionaries, only bond length and type dictionaries; all visualisation software used in structural databases to graphically express the data from PDB data files uses this approach. The second approach that is used is the “Molecular Modelling Database (MMDB)” this approach derives the graphical representations of any 3-Dimensional protein structure by using data from not only contain within PDB but also uses standard residue dictionaries (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). 12
    15. 8. Structural Visualisation There are several styles and software that are used to graphically depict protein structures, this is primarily due to the need to visualize particular aspects of a proteins structure, in which the main source of information used to create such graphical representation would be PDB data files, which surplus regions are frequently edited to enable a user of molecular visualization software to visualize what they want. There are a number of graphical outputs, which are (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005): o Wireframe Model format: Details the chemistry of a molecular structure, see figure 8.1. o Space-filled Model format: Details the size and surface of molecular structure, see figure 8.2. o Ribbon Model format: Details the organisation and path of secondary structure elements and enable the identification of secondary structures in complex topologies, see figure 8.3. Figure 8.1 - Wireframe Model of a Figure 8.2 – Space-Filled Model of an Figure 8.3 – Ribbon Model of an HLA-A2 GFP molecule: ADAPTED FROM – HLA-A2.1 molecule: ADAPTED FROM molecule: ADAPTED FROM – Center for BioMolecular Modeling – Department of Crystallography, Department of Crystallography, Birkbeck (CBM), Milwaukee School of Birkbeck College [8]. College [8]. Engineering [7]. Structural file formats come in three different forms, which are firstly the pdb file format, which is column oriented textual file format that describes three dimensional structures of molecules. Each pdb file contains a high level and extensive description of a protein’s properties. Each pdb file contains hundreds of lines of information about atoms and their coordinates as well as the sequences of amino acids contained with a protein (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). The first file format is the pdb file format, there are several section to a normal pdb file beginning with the HEADER which specifies the pdb id code, the TITLE of the file that contain the name of the protein and AUTHOR lists the contributor and researchers, these records appear first in a pdb file. The next section is the ATOM which is a record of each atom and lists each atom’s atomic coordinate (x, y and z) that are part of the protein, the following section is the HETATM which is a record of the hetero-atoms and like the ATOM lists each atom’s atomic coordinate (x, y and z) of each hetero-atoms. Hetero-atoms are not part of the overall protein. The following section of SEQRES it is a record that holds and lists details of the primary protein sequence and peptide chains, which are denoted A, B and C within a single protein. The following section is known as REMARK, this part of a pdb file contains standardized information and annotations and remarks about the protein structure (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). 13
    16. The second file format is the Macro Molecular Crystallographic Information File (mmCIF), is a file format which is made of several tokens that includes data blocks, this file format is derived from Chemical Interchange Format (CIF). The mmCIF file format contains a macromolecular CIF dictionary, in which each item of data is matched to an entry in the CIF macromolecular dictionary, allow sequence validation to occur. It contains spatial grouping and unit cell parameters as well as atomic coordinates like the pdb file. Within the mmCIF each id is numbered integers, which hold the same information as pdb files; all data names are case sensitive along with derived information from primary coordinate data allowing less ambiguity to occur (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Figure 8.4 - Rasmol: Shows the structure of Thymidylate Synthase (PDB ID: 2TSC), seen through the The third file format is the MMDB file format, which uses the ASN.1 standardized data RasMol Viewer Software. The RasMol generated image highlights language, which borrows characteristics from other data for describing such things as helices (orange) and sheets (green). references and citing, this file format is either stored as text or binary files. This enables the ADAPTED FROM: the Protein ChemCards, Bioinformatics Courses representation of complex data types. All notations used for describing data are transferred and Lectures [21]. or transmitted using telecommunication protocols and allowing the physical representation of descriptive, atomic coordinate and sequential data through phone lines for access through the web. This format is used by NCBI to store GeneBank, PubMed and MMDB (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). There are a number of different software, which are used to not only examine molecular structures of proteins but also are used to display structural information and produce high resolution 3-Dimensional pictures, which are mostly java based three programs that are able Figure 8.5 - Cn3D: Shows of structure of the SRY protein, to interpret protein databank data, excluding RasMol (Attwood T.K. et. al. 1999; through the Cn3D Viewers Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Softwares. The Cn3D genrated image highlights the DNA strands backbone in blue and brown, whilst First this document will look into RasMol viewers, is a java based molecular visualization the protein alpha helices are in green & loops are in light blue. ADAPTED software, it is one of the most widely used and most popular software’s and is seen as the FROM: Institute of Biology and most accurate. RasMol uses chemical graphs and pdb files it does not validate either of Department of Medical Genetics, Charles University [22]. these of the residue library or perform alignments of inherent sequences. Ramol as molecular visualisation tool recalculates information and edits out inconsistencies and is able to use mmCIF formatted files. RasMol is a free open source program, which is requires toolkit library to enable it to create visual representations of proteins that are interactive. RasMol images contain information such as the types of components, atom serial number, atom name, coordinates of each atom, which are standardized and are expressed identically each time used, see figure 8.4 (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Figure 8.6 - Jmol: Shows the Cn3D like RasMol is a 3-Dimensional protein structure viewer, which specifically is used structure of Hemoglobin, The Jmol to read, translate and view the 3-Dimensional structure encoded within MMDB data generated image highlights the backbone using trace colors and the records. Explicit bonding information can be used since without errors or unknown heme groups though spacefilled. chemical graph expressions enabling a full and more reliable 3-Dimensional expression of ADAPTED FROM: Screen Shots, taken from the Jmol online Webpage protein structures. And is far more dependant on a more complete chemical graph expressed [23]. through the ASN.1 language used in MMDB files, which it is able to animate each structure and allows and can run structural alignments through the use of VAST, see figure 8.5 (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). 14
    17. Jmol like RasMol is a molecular viewer used in bioinformatics, biochemistry and chemistry, like RasMol it is a free open sourced program, which is java based. Jmol is a multi platform program that able to be run on Windows, Mac, Linux and Unix systems, which make it versatile to use and is easily incorporated into other java application, it also can be used and accessed via the web. The Jmol program is able to use and run a variety of molecular file formats, which are: pdb, cif, mol and cml, see figure 8.6 (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Another molecular visualization software and tool is the Molecular Biology Toolkit (MBT), this requires additional applications for it to be run as a program and is not a free standing/running program, but like Jmol and RasMol its arranged libraries are arranged in a hierarchical form and restricts the affects of classes on molecular components within a graphical protein structure. 2-Dimensional and 3-Dimensional graphical images and representations can be displayed and be created, using MBT along with Java3D (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). Rasmol and Chime along with some other molecular viewing tools use scripting although MBT does not use scripting, which is a method of executing for variables and acts as portal in the running of methods and compilation. Scripting allow the use of menus within graphical viewing programs to enable a user to make changes to the 2-Dimensional and 3-Dimensional graphical images of proteins. Scripting allows the commanding complex coding to be easily remembered and accessible to the user (Attwood T.K. et. al. 1999; Baxevanis, A.D. et. al. 2005; Kane, D.E. et. al. 2003; Lesk, A. et. al. 2005). 15
    18. 9. Evaluation As summarised in the abstract and introduced, the scope of information that is available about protein Structures obtained from Genome Sequencing Projects is quite vast. Bioinformatics is a whole new region of science, which consists of a wide range of both scientific and computational aspects that involves scientific experimental determination of protein structures and the analysis of biological sequence information of DNA, RNA and Peptides. As well the recovery of evolutionary patterns within proteins, prediction of gene function and biological data mining of information using high powered computational methods.As mentioned within this document the specific section of bioinformatics that deals with the determination, analysis, retrieval and representation of protein structural information is known as "Structural Genomics". The “Primary” (Traditional) meaning of Structural Genomics is the: characterize of the physical structure of a complete genome through the use of gene mapping and sequencing, such as through the Human Genome Project and the subsequent genome projects such as the: Escherichia coli, Saccharomyces cerevisae, Drosophila melanogaster and Caenorhabditis elegans. The modern representation and meaning of structural genomics is the: determination of three-dimensional protein structures through the use of genome sequencing projects. Structural Genomics has two approaches available to enable the prediction of protein structures which then can be added/submitted to the structural databases. The first approach focuses on the prediction of a protein’s structure from the same set of protein and enabling the complete visual representation of a range of protein folds and domain structures. This approach relies heavily on the ideology that protein domains, folds and extended structures of the Secondary and Tertiary levels of a protein organisation are limited. This approach uses Computer-Based methods, for these methods to create a functional, accurate and representative computer designed illustration of a protein, there has to be a high degree of confidence and similarity between the undetermined sequence and the determined protein sequence. With the advancement of programming and computer technology, there would be a greater effect on not only the success of comparative computational modelling, but would enable the increase in accuracy and in the illustration & visualisation of computer modelled proteins. Computational prediction and modelling methods are also dependant on the experimental determination of proteins and the purification of proteins and to enable greater ease and success in purification of these proteins, proteins from Hyper-Thermophilic bacteria or Archaea and the genetic sequences that code for them can be also easily replicated and cloned through recombination and transformation techniques using Escherichia coli. The proteins which are purified using this type of recombination are 3-Dimensionally Determined using either X-Ray Crystallography or NMR Microscopy. The greater the number and accuracy of determined proteins, allows a greater representation of domain structures and folds within protein structures and extends the abilities of homology modelling (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). Within this approach, it relies on three broad Structural prediction methods: the first being comparative modelling this method relies particularly on protein families and locating protein homologues using PDB files and sequences, which uses the identified PDB homologue as a template. Comparative modelling which is a similarity template modelling technique would contain increasing errors as the similarity between a sample protein sequence and template sequence decreases. The errors which can occur can include the divergence of peptide side chains in core protein sequences and are critical when in regions of protein function such as ligands and binding sites, other errors which occur in the alignment and sequence comparisons involve the distortion or shift in aligned sequence regions causing alternative protein conformations in small localised regions outside of the alignment segments. These localised distortions in the alignment of localised regions can be up to 3 A°; this also includes the effects of subunit packing, these effects can be minimized through the use of multiple alignments of the sample and template sequences. Errors are more frequent in segments that are not-aligned or do not have any templates, which create inaccuracies within a model. But the largest source of errors that come from using Homology Modelling is from misalignments, in particularly when identity between the sample sequence and the template sequence falls below 30 percent similarity. To create a viable and accurate protein model, the conditions of a high standard prediction model have to be met, the first being correct alignment and the second being the accuracy of the modelling. There are two ways of aligning the sample sequence and the template to reduce the level of errors in a model. The first is to use multiple alignments and the second way is to “iteratively modify” the regions to enable the prediction of errors in a model (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). The second structural prediction method is the Fold (Threading) Recognition Modelling, this is used where there is no or little sequential similarity and uses what is know about structural conformation based on what is know about each amino acid and the probability and preferences each peptides has for any one secondary structure through "Fold Recognition". Fold (Threading) Recognition Modelling, determines an unknown structure by how well it fits in certain sequences models (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). The third type of structural prediction is Ab initio modelling in comparison with the other methods mentioned in this document is “Ab initio prediction” modelling and is 16
    19. used to build molecular models for any given sequences without using a template and by using minimal energy functions and lattice models. The advantages of Ab initio modelling is that the mathematical calculations used in creating a protein model are very accurate this is through the use of the properties that match the most to the experimental data, but can only be used in relation to smaller molecules and is usually used for molecules that contain 50 atoms (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). The second approach of Structural Genomics, is the experimental determination itself, there are three main methods of experimental determination protein structures. Which are X-Ray Crystallography, NMR Microscopy and Cry-Electron Microscopy, each of these techniques is time consuming but have there own advantages depending on what type of protein structure is being analysed. The experimental determination method, X-Ray crystallography, produces high resolution molecular representations at 2Å, X-Ray diffraction only produces visualisation of molecular structures that are static. Structural representations which are produced using X-Ray diffraction do not indicate of help to explain functions, structures in crystallised proteins, such as surface loops are seldom detected and as a result several protein structures incomplete, this is mainly due to the fact that X-Ray Diffraction and is highly dependant on electron density for diffraction of the X-Rays to produce patterns need to determine the proteins structure (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005). X-Ray Crystallography is also quite time consuming and crystals are often difficult to grow, but with the NMR Microscopy this allows the detection of structures like surface loops in solution and as well as removes the problem of static conformations and only takes a fraction of the time. NMR allows not only the characterization of macromolecular structures but also their intermolecular interactions as well as incorporates high spatial and maintains a high temporal resolution. NMR also requires the knowledge of the peptide sequences, but the protein does not have to be in an ordered crystal, yet high concentrations of solubilised protein must be available (NMR structures are therefore also called solution structures). In biopolymers, the primary structure (sequence) logically breaks up the molecule into groups of coupled spins normally one or two groups per residue. This is true not only for proteins, but also for nucleic acids and polysaccharides. A third technique which is used in the structural determination of proteins is Cryo-Electron Microscopy (CET), this technique freezes protein samples very rapidly to extremely low temperatures, the low temperatures and rapid freezing of a sample allows the synthesis of highly ordered sheets that can produce high resolutions of between 5 to 10Å. The technique also enables the depiction of quaternary structures of a protein and enables the creation of extensive structural information. CET samples like NMR samples are solution based and like NMR proteins appear in there natural formation. Although sample can be damaged when being blotted, but sample proteins are not distorted when stained. CET, allows the sample protein to adhere to a grid in a preferential way, to the protein. Cryo – EM, resolutions can be fuzzy due to lack of absorption of electron beans with in the molecular structure as well as like X-Ray Crstallography sample preparation is quite time consuming (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005"; Heymann, J. B. et. al 2001; Heymann, J. B. et. al 2007). The second Table 03: NMR comparison with X-Ray Crystallography approach of NMR X-ray crystallography Structural short time scale, protein folding long time scale, static structure solution, purity single crystal, purity Genomics, is the < 20kD, domain any size, domain, complex experimental functional active site active or inactive domains Domains determining atomic nuclei, chemical bonds electron density resolution limit 2-3.5Å resolution limit 2-3.5Å itself, there are primary structure must be known primary structure must be know three main (except if resolution is 2Å or better for every single residue) methods of experimental determination of protein structures. Which are X-Ray Crystallography, NMR Microscopy and Cry-Electron Microscopy, each of these techniques is time consuming but have there own advantages depending on what type of protein structure is being analysed. The experimental determination method, X-Ray crystallography, produces high resolution molecular representations at 2Å, X-Ray diffraction only produces visualisation of molecular structures that are static. Structural representations which are produced using X-Ray diffraction do not indicate of help to explain functions, structures in crystallised proteins, such as surface loops are seldom detected and as a result several protein structures incomplete, this is mainly due to the fact that X-Ray Diffraction and is highly dependant on electron density for diffraction of the X-Rays to produce patterns need to determine the proteins structure see Table 03 (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005; Heymann, J. B. et. al 2001; Heymann, J. B. et. al 2007). 17
    20. X-Ray Crystallography is also quite time consuming and crystals are often difficult to grow, but with the NMR Microscopy this allows the detection of structures like surface loops in solution and as well as removes the problem of static conformations and only takes a fraction of the time. NMR allows not only the characterization of macromolecular structures but also their intermolecular interactions as well as incorporates high spatial and maintains a high temporal resolution. NMR also requires the knowledge of the amino acid sequence, but the protein does not have to be in an ordered crystal, yet high concentrations of solubilised protein must be available (NMR structures are therefore also called solution structures). In biopolymers, the primary structure (sequence) logically breaks up the molecule into groups of coupled spins normally one or two groups per residue. This is true not only for proteins, but also for nucleic acids and polysaccharides. A third technique which is use in the structural determination of proteins is Cryo-Electron Microscopy (CET), this technique freezes protein samples very rapidly to extremely low temperatures, the low temperatures and rapid freezing of a sample allows the synthesis of highly ordered sheets that can produce high resolutions of between 5 to 10Å. The technique also enables the depiction of quaternary structures of a protein and enables to creation of extensive structural information. CET samples like NMR samples are solution based and like NMR proteins appear in there natural formation. Although sample can be damaged when being blotted, but sample proteins are not distorted when stained. CET, allows the sample protein to adhere to a grid in a preferential way, to the protein. Cryo – EM, resolutions can be fuzzy due to lack of absorption of electron beans with in the molecular structure as well as like X-Ray Crstallography sample preparation is quite time consuming (Branden, C, et. al. 1991; Whitford, D. 2005; Hames, D. et. al. 2005; Heymann, J. B. et. al 2001; Heymann, J. B. et. al 2007). Bioinformatics databases are split up into several categories, which have been reviewed broadly in section 3. There are three type of database which is the: Primary, Secondary and Tertiary “Structural Classification” databases, each of which has its importance in bioinformatics and the determination and prediction of protein structures. The primary databases are used to locate and to match the similarities between a sample of unknown sequence and sequences that are contained within the database. It allows the rapid identification and classification of protein sequences, whilst the secondary database however enables more extensive information about the protein structures to be retrieved and stored. Secondary databases store and maintain structural data of each sequence as well as other derived information which allows the formation of structural illustrations, these are held normally in files which come in a number of different formats depending on the database as well as what visualisation software would be used to visualise and illustrate the structural. The Secondary and Structural Classification databases express in detail the higher level organisation of protein structures including alpha helix, beta sheet and domain/motif structures that are present in a proteins structure, whereas Primary databases do not contain such information. The Structural Classification databases go further to allow the comparison between protein structures to search for similarities and enables structural classification of folds, secondary structures and extended structures. For Structural Genomics the most useful databases and the ones which are primarily used for comparing as well as visualising structural information of proteins. There are a variety of informational formats that allow the viewer of the retrieved information from a database to view. As depicted in Section 6 and in Table 02, there are flat files that contain sequential, atomic and other protein data, but there are more extensive information which depict the visual aspect of a protein these come in table and image formats that can be altered according the users need and to view either functional or structural information about each protein modelling (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). SCOP and CATH databases are both structural classification databases, SCOP relies on 30 to 50 percent sequences similarity whilst CATH relies on a higher level of sequence identity (60 percent). SCOP and CATH both highly dependant automated methods and manual methods, but the SCOP database’s automatic method tends to be unreliable in the comparison of structural relationships. The CATH database uses the Enzyme Classification (E.C) system allowing more efficient computational manipulation of data. Both CATH and SCOP are both hierarchical domain classification systems for proteins which use keyword interrogation system to search the database. Whereas the Protein Data Bank in comparison to both e SCOP and CATH, not only expresses information “Relational Oriented” and “Object Oriented” formats but also contains extensive “Flat File” format outputs. Protein Databank files contains residue dictionaries, atom coordinates and sequential information which are maintained in chemical graphs, as well as holds the details of the authors and a description of the protein. The Protein Databank has mad use of two different types of file formats (mmCIF and MMDB) which allows the expression of protein structures visually and contain both “implicit” and “explicit” data. NCBI on the other hand contain an even larger database of sequences as well as structures like the other three databases NCBI is an online database which is easily accessible through the internet. This database is integrated with several other databases and is able to compile and retrieve information from each incorporated source. Other than sequential and structural information as in the PDB database NCBI provides additional information such as chromosome maps as well as the integration of scientific literature, DNA & Peptide sequence databases, 3D Protein Structure & Domain Data and taxonomic information, but all results are maintain and expressed (linked) in MMDB Structure Summary formats. Like Protein Databank the majority of protein structures have been 18
    21. determined using either X-ray crystallography or NMR-spectroscopy. Unlike Protein Databank, NCBI is linked to a Vector Alignment Search Tool, also know as VAST, and enables the detection and determination of structural similarities between 3-Dimentional Structures modelling (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). Protein Databank uses the column orientated “flat” format pdb file to express structural data for each protein and contains highly extensive level of information about atomic coordinates and other structural properties and is split up into several sections and as stated above pdb files include information atom coordinates and sequential information which are maintained in chemical graphs, as well as holds the details of the authors and a description of the protein. In comparison mmCIF is similar in structure to the pdb files in that it is sectioned in to several blocks of extensive information but in addition to the pdb file, the mmCIF file also contain a residue dictionary which enables structural validations to take place and therefore allows the synthesis of a more accurate image to form when using molecular viewers. mmCIF as mentioned above reduces the ambiguity within structural and sequence conformations. NCBI uses a different file known as MMDB, but like the pdb and mmCIF files it is also highly textural and highly detailed in its atomic and molecular descriptions of each protein but like the pdb files it does not contain or is linked to residue dictionaries therefore is not able to validate atomic and molecular conformations. Each of these file formats are used by different visualisation software, the first visualisation software this document has looked at was RasMol, this program as previously depicted above uses the mmCIF file format to retrieve structural and atomic data but also to express visually with high levels of conformational accuracy due to the use of residue dictionaries, which enables Rasmol to validate and to calibrate the visualised images more effectively. Like the other programs for visualising proteins, it is available online and can be readily used to view protein structures through the web. In comparison with Rasmol, the visualisation software Cn3D translates data files like Rasmol but uses the file MMDB instead of the mmCIF file. This means that there is know residue dictionary to validate visualised structures, but removes errors in chemical bonding through the use of the ASN.1 language and allows streamlining and easy access to structural information through web servers and phone lines. Akthough this would not compensate for the accuracy and the level of validation through the use of the RasMol software. Jmol is a broad based visualisation software which is able to utilise and run the same scripts as RasMol. Jmol uses not only pdb file format but also uses and extend range of other file formats including mmCIF files which current RasMol versions are also able to use. Jmol is broader since it is able to view structural information from other file formats other the pdb and mmCIF files. Jmol like RasMol can validate and accurately maintain visualised structural conformations, since it is able to access residue libraries using the CIF data files for protein structures modelling (Baker, D. et. al. 2001; Leach, A.R. 2001; Moult, J. 1999; Samudrala, R. 2002). 19
    22. 10. Conclusion Depending on what a researcher or user of Genomic Databases is looking for, has been touched on by this document. There are extensive online resources that are freely accessible through the web that not only offers a wide range of facilities but also extensive information. Primarily with Structural Genomics a lot of the structures that are now available, would not have been there if it was not for the experimental methods used to determine them, since the computational techniques which are used to model them depend on the experimental methods to determine structural conformations of proteins. This allows the comparison of structure conformation to peptide/nucleotide sequences and can be used in the comparison of undetermined sequences with pre-determined structural conformations of known sequences. Even though each experimental and computational methods have there own merits, the most accurate and reliable experimental techniques that can be used in conjunction with computational methods are X-Ray Crystallography and NMR Spectroscopy (see Table 03). X-Ray Crystallography is a method which is better suited for larger proteins that are larger than 20kD and offers a resolution of between 2-3.5Å, whilst NMR spectroscopy is better suited for smaller protein structures and smaller/isolated domains less than 20kD. But to gain extra definition within structures within a protein that can not be determined by X-Ray Crystallography, Cryo-Electron Microscopy can be used show areas of protein that are not easily shown by X-Ray Crystallography, but with less accuracy of between 5-10Å Each of the predictive, computational modelling methods are used for a set purpose. Homology modelling is used for an undetermined protein sequence with a known homologue, hence above 30% similarity. If there is less identity than 30%, then the undetermined protein is put through Fold recognition modelling, where the structure is fitted to a set protein model, if know accurate model is found then a protein sequence is modelled using Ab initio prediction, which used physical laws to dictate a proteins conformation. The homology modelling is by far the most accurate out of the three methods with an accuracy of about 3Å and is the quickest and easiest to perform along with the Fold Recognition which is just as quick to model and is only marginally less accurate [24]. Both the primary and secondary databases can be used in structural modelling, whilst the structural databases contain protein structures. The best structural classification system would be CATH in comparison with SCOP, since CATH relies on 60 percent identity as well as it uses the E.C system that allows greater computer manipulation of data classification. Concerning databases that contain actual structural representations instead of there classification, Protein Databank is the better Structural Database due to the fact that each structural file of proteins that is held within the database would be able to validate protein structures from residue dictionaries that are in the mmCIF file format which has been updated from the older pdb file format, which contains identical information on sequences, atomic coordinates, protein ID’s “identifier code”, name of protein and list of authors along with descriptions about the structure of the protein. Further more the use of both Rasmol and Jmol with this database give greater versatility in the visualisation of proteins, along with Jmols capabilities to visual protein structures from other file formats other the pdb and mmCIF (excluding MMDB). NCBI is a far greater tool for similarity comparisons between protein structures, due the fact that it allows a greater range of data searching and compiling obtaining information from a larger variety of databases “Global Query Cross-Database” and also enables viewing of both primary and secondary databasess allong with extensive chromosomal maps. Both Jmol and Rasmol are the most accurate in visual interpretation software’s in the depiction of protein structures, due to there access to residue libraries. But Jmol is better able to read multiple file formats and is in that respect more versatile than Rasmol. 20
    23. References Attwood T.K. and Parry-Smith D.J.; Introduction to Bioinformatics. Longman (1999). Bae, E., and George N. Phillips, Jr, G.N.; Structures and Analysis of Highly Homologous Psychrophilic, Mesophilic, and Thermophilic Adenylate Kinases*; The Journal of Biological Chemistry Volume: 279; Number: 27; Page Numbers: 28202–28208 (2004). Baker, D. and Bonneau R.; Ab Initio protein structure prediction: progress and prospects. Annul. Rev. Biophys. Biomol. Struct. 30, 173 (2001). Baker, D., Bonneau, R., Chivian, D., Ruczinski, I., Rohl, C., Tsai, J., Strauss, C. E. M.; ROSETTA in CASP4: Progress in Ab Initio protein structure prediction. Proteins: Structure, Function, and Genetics Suppl 5, 119 (2001). Baker, D., Sali, A.; Protein structure prediction and structural genomics. Science.294, 93. (2001). Bates, A.D., Turner, P.C.; McLennan, A.G.; White, M.R.H.; Instant Notes: Molecular Biology (2nd Edition); BIOS Scientific Publishers (2000). Baxevanis, A.D. and Ouellette, B.F.F. (eds.); Bioinformatics. A Practical Guide to the Analysis of Genes and Proteins (3rd edition). John Wiley (2005). Berg, J.M, Tymoczko, J.L., Stryer, L.; Biochemistry (6th Edition); W.H. Freeman (2007). Branden, C. and Tooze, J.; Introduction to Protein Structure, Garland Publishing; (1991). Bourne, P.E. (Editor) and Weissig, H. (Editor); Structural Bioinformatics, WileyEurope (2003). Bowie, J.U.; Solving the membrane protein folding problem. Nature 438: 581-589 (2005). Campbell, A.M. and Heyer, L.J.; Discovering Genomics, Proteomics and Bioinformatics. Benjamin Cummings (2007). Fersht, A.; Structure and Mechanism in Protein Science. W.H.Freeman and Co. (1999). Gibas, C. and Jambeck, P.; Developing Bioinformatics Computer Skills. O’Reilly and Associates Inc. (2001). Hames, D.; Hooper, N.; (Third Edition), Instant Notes: Biochemistry, Taylor & Francis (2005). Heymann, J. B.; Bsoft: image and molecular processing in electron microscopy. Journal of Structural Biology 133 (2-3): 156 – 69 (2001). Heymann, J. B., and Belnap, D. M.; Bsoft: Image processing and molecular modeling for electron microscopy. Journal of Structural Biology 157: 3 – 18 (2007). Heymann, J. B., Cardone, G., Winkler, D. C. and Steven, A. C.; Computational resources for cryo-electron tomography in Bsoft. Journal of Structural Biology in press (2007). Hickey, G.I., Fletcher, H.L., Winter, P.; Instant notes in Genetics (3rd Edition) Taylor & Francis Group, (2007). Kane, D.E. and Rayner, M.L.; Fundamental Concepts of Bioinformatics. Benjamin Cummings (2003). Kleanthous, C. (ed.); Protein-protein Recognition. Frontiers in Molecular Biology. Oxford University Press (2000). Leach, A.R.; Molecular Modelling. Principles and Applications (2nd edition). Longman (2001). Lesk, A.; Introduction to Bioinformatics (2nd Edition), Oxford University Press (2005). Moult, J.; Predicting protein three-dimensional structure. Current Opinion in Biotechnology 10 (6) 583-588 (1999). Patrick, G.L.; Organic chemistry (2nd Edition), Taylor & Francis Group, (2004). 21
    24. Petsko, G.A. and Ringe, D.; Protein Structure and Function. New Science Press Ltd (2004). Samudrala, R.; Modeling genome structure and function; Pure Appl. Chem., Vol. 74, No. 6, pp. 907–914 (2002). Turner, P.; Molecular biology (3rd Edition), Taylor & Francis, (2005). Westhead, D.R., Parish, J.H. and Twyman, R.M.; Instant Notes: Bioinformatics. BIOS Scientific Publishers (2002). Zubay, G.L.; Biochemistry (4th Edition), Wm. C. Brown Publishers (1998). 1. http://www.chem.ucsb.edu/~kalju/chem110L/public/tutorial/images/ 2. http://www.langara.bc.ca/biology/mario/Biol2315notes/biol2315chap3.html 3. http://kentsimmons.uwinnipeg.ca/cm1504/proteins.htm 4. http://research.yale.edu/ysm/article.jsp?articleID=51 5. http://bouman.chem.georgetown.edu/nmr/protein.htm 6. http://www.bnl.gov/bnlweb/pubaf/pr/PR_display.asp?prID=07-73 7. http://www.rpc.msoe.edu/cbm2/gfp1.htm 8. http://www.cryst.bbk.ac.uk/PPS2/projects/vun/MHC_master.htm 9. http://www.med.unibs.it/~marchesi/pps97/course/section9/9_term.html 10. http://bioinfo.ssu.ac.kr/bbs/zboard.php?id=link_new&page=1&category=&sn=off&ss=on&sc=on&keyword=&prev_no=&sn1=&divp age=1 11. http://www.ctwatch.org/quarterly/print.php?p=83 12. http://genome.gsc.riken.go.jp/hgmis/posters/chromosome/pdb.html 13. http://home.cc.umanitoba.ca/~psgendb/GDE/dataset/dataset.html 14. http://chemweb.calpoly.edu/llindert/313-structure-tutorial.html 15. http://florey.biosci.uq.edu.au/Subjects/BC327/Material/ 16. http://jmol.sourceforge.net/ 17. http://www.umass.edu/microbio/rasmol/ 18. http://mbt.sdsc.edu/ 19. http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml 20. http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3dtut.shtml 21. http://www.bioinformaticscourses.com/ISB/sp2003/2TSC/ 22. http://biol.lf1.cuni.cz/ucebnice/pohlavi.htm 23. http://jmol.sourceforge.net/screenshots/ 24. http://lectures.molgen.mpg.de/Algorithmische_Bioinformatik_WS0405/material/Steinke_lecture_19_1.pdf 22
    SlideShare Zeitgeist 2009

    + Sami El-SabbahySami El-Sabbahy Nominate

    custom

    160 views, 0 favs, 0 embeds more stats

    Evaluation of Protein Structure Information from Ge more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 160
      • 160 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 7
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?