5. The need for bioinformaticists. The number of entries in data bases of gene sequences is increasing exponentially. Bioinformaticians are needed to understand and use this information . 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 GenBank growth
20. Phylogeny inference: Analysis of sequences allows evolutionary relationships to be determined E.coli C.botulinum C.cadavers C.butyricum B.subtilis B.cereus Phylogenetic tree constructed using the Phylip package
24. RNA structure prediction: Structural features of RNA can be predicted G G A C A G G A G G A U A C C G C G G U C C U G C C G G U C C U C A C U U G G A C U U A G U A U C A U C A G U C U G C G C A A U A G G U A A C G C G U
25. Protein Structure : the 3-D structure of proteins is used to understand protein function and design new drugs
26. Gene Sequencing: Automated chemcial sequencing methods allow rapid generation of large data banks of gene sequences
As a result, the last few years have seen an explosion in the field of bioinformatics, a new field of study which combines methods from computer science and information technology to analyze biological information. In its purest definition, bioinformatics is the application of information technology to biology.
Rather than sequencing isolated genes, more and more research groups and companies are now focussing on sequencing whole genomes from organisms of medical, commercial or scientific importance. The first complete bacterium to be completely sequenced was Haemophilus influenzae in 1995. In 1996, the first complete eukaryotic genome, that of baker’s yeast ( Saccharomyces cerevisiae ) was published. New complete genomes are now being published every month, and human genome projects, both publicly and privately funded, are well on the way to completion.
The new genome technologies coupled with bioinformatics promise a revolution in almost all fields of life sciences and in society. For example, just in the medical sciences: In the pharmaceutical industry, these methods have been embraced as a shortcut to the discovery of better drugs. For example, knowledge of a protein’s structure can shorten considerably the time taken to develop specific inhibitors of this protein for therapeutic use. The study of how genome variation affects drug effectiveness (pharmacogenomics) is still in its infancy, but promises to deliver more effective and specific therapeutic drugs which are tailored to the individual’s genetic make-up. A knowledge of the genome also facilitates the targeting of genetic diseases by drug or gene therapy. Genome analysis also provides the framework for the study of gene and protein expression using DNA microarray technology or 2-dimensional gene electrophoresis, with broad-ranging applications. And these techniques can be applied not only in the medical sciences, but also in agriculture, biotechnology etc…
The last 10 years have seen recombinant DNA techniques pervade the whole of biology and biology-related fields. The use of plasmids, restriction enzymes, DNA sequencing methods and, more recently, PCR, have allowed the cloning and characterization of many genes and of their protein products. The growth in DNA sequence data available to researchers is phenomenal. For example, GenBank, a major database where molecular biologists store the DNA sequences they obtain and make them available, doubles in size approximately every 14 months. At the beginning of 1999, Genbank contained over 3 million sequence records, and grew at a rate in excess of a million nucleotides deposited per day! Genbank is shown here as an example, but other sequence databases would grow at similar rates. Source: genbank release notes, National Center for Biotechnology Information (http://ncbi.nlm.nih.gov/)
As the application of information technology to biology, bioinformatics pervades the whole of biology, including genetics, biochemistry, ecology and medicine. However, much of the publicity and emphasis which bioinformatics has received in the last few years has been on DNA and protein sequence analysis. Given the large amount of sequence data available and the rate at which it is growing, this is where the need for computer analysis has been felt the most. DNA and protein sequences are particularly amenable to computer analysis, since they can be represented by strings of letters, which computers are very apt to deal with. A DNA sequence is a string of 4 letters (A, C, G and T), and a protein sequence can also be represented by a string of 20 letters, each of which represents an amino acid
The next part of the lecture uses flowcharts to outline a range of procedures commonly used in computer-assisted biomolecular sequence analysis. This rather complicated flowchart summarizes this whole section of the lecture. The flowchart will be divided into four sections: Sequence entry: getting the sequence into the computer Nucleotide sequence analysis Protein sequence analysis Multiple sequence analysis (working with multiple sequence alignments) Each step of the flowchart will be examined in turn
Multiple sequence alignments can therefore be used as input to create phylogenetic trees representing possible evolutionary relationships. The principle is that the more closely related two species, the more similar their homologous sequences will be (in general - there are many exceptions) For example, according to the above tree, B. subtilis and B. cereus are more closely related to each other than to C. botulinum, C. cadavers, C. butyricum or E. coli. This tree was created from an alignment of the 16s ribosomal RNA sequences from the various bacteria. Further reading: molecular phylogeny is a very large field in itself, with a lot of associated literature. A good introduction to the field can be found in: Swofford, Olsen, Waddell and Hillis (1996) “Phylogenetic inference” in Molecular Systematics (2nd ed), DM Hillis, C Moritz and BK Mable eds.Sinauer Associates, Inc. Sunderland MA, USA
PCR planning programs let the user specify criteria such as primer length, melting temperature, GC content etc...
This type of display is produced by the program mapplot , part of the GCG package. It lists the restriction enzymes which cut a particular sequence (together with their recognition sequence) and creates a graphical representation of the sequence with the cutting sites marked along a line representing the sequence. This type of image is useful for finding suitable restriction enzymes for subcloning a particular sequence fragment, or for producing a distinctive restriction pattern for in vitro diagnostic procedures. Enzyme name Recognition sequence cutting sites
This transfer RNA cloverleaf structure was predicted for a tRNA sequence using Michael Zuker’s program mfold , which has been incorporated in the GCG package. Further reading: M. Zuker, D.H. Mathews & D.H. Turner (1999) “Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide” In RNA Biochemistry and Biotechnology , J. Barciszewski & B.F.C. Clark, eds., NATO ASI Series, Kluwer Academic Publishers Also available online at http://www.ibc.wustl.edu/~zuker/seqanal/
There are several approaches to building a 3 dimensional model for a protein: Homology modeling uses sequence similarity to map a sequence onto the known structure of a similar sequence (for example, using BLAST to search the PDB database) Profiling involves converting known structures into 3D profiles where the residue preference for each position is classified according to secondary structure (helix, strand, coil) and hydrophobicity/accessibility (exposed, partially exposed, buried). The query sequence can then be mapped onto a library of 3D profiles and the best matching profiles are selected. Threading also involves mapping a sequence onto a library of structures, but only structural information is used. Instead, pseudo-potential energy functions are used to evaluate residue-residue interactions. The query sequence is “threaded” through the various potential structures in the library and the folds yielding the lowest interaction energy when the sequence is mapped onto them are selected. For example, a fold which bring two residues of opposite charge close together will be considered a better fit than a fold which brings together two residues of the same charge or two large residues which would cause a steric clash. (Slide and notes courtesy of Dr Shoba Ranganathan, Australian Genomic Information Centre)
Large scale sequencing projects make use of automated sequencing machines connected to a computer. Because the sequencing machines are typically limited to 300-600 nucleotides, it is often necessary to break down large sequences into fragments, sequence these fragments, then reconstruct the original complete sequence by searching for regions in common between the gel readings, using specialized software. This picture shows some windows from gap4 , a sequencing project management program which is part of the Staden package. This type of software helps in the management of sequencing projects not only by assembling gel readings but also by searching and removing vector sequences, repeat sequences and poor quality sequence regions which can cause problems when assembling the fragments Further reading: Staden, R., Beal, K.F. and Bonfield, J.K. (1998) The Staden Package, Computer Methods in Molecular Biology Eds Stephen Misener and Steve Krawetz. The Humana Press Inc., Totowa, NJ 07512 Also available at: http://www.mrc-lmb.cam.ac.uk/pubseq/methods_in_mol_biol/index.html