Project report-on-bio-informatics


Published on

Published in: Health & Medicine, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • As a result, the last few years have seen an explosion in the field of bioinformatics, a new field of study which combines methods from computer science and information technology to analyze biological information. In its purest definition, bioinformatics is the application of information technology to biology.
  • Rather than sequencing isolated genes, more and more research groups and companies are now focussing on sequencing whole genomes from organisms of medical, commercial or scientific importance. The first complete bacterium to be completely sequenced was Haemophilus influenzae in 1995. In 1996, the first complete eukaryotic genome, that of baker’s yeast ( Saccharomyces cerevisiae ) was published. New complete genomes are now being published every month, and human genome projects, both publicly and privately funded, are well on the way to completion.
  • The new genome technologies coupled with bioinformatics promise a revolution in almost all fields of life sciences and in society. For example, just in the medical sciences: In the pharmaceutical industry, these methods have been embraced as a shortcut to the discovery of better drugs. For example, knowledge of a protein’s structure can shorten considerably the time taken to develop specific inhibitors of this protein for therapeutic use. The study of how genome variation affects drug effectiveness (pharmacogenomics) is still in its infancy, but promises to deliver more effective and specific therapeutic drugs which are tailored to the individual’s genetic make-up. A knowledge of the genome also facilitates the targeting of genetic diseases by drug or gene therapy. Genome analysis also provides the framework for the study of gene and protein expression using DNA microarray technology or 2-dimensional gene electrophoresis, with broad-ranging applications. And these techniques can be applied not only in the medical sciences, but also in agriculture, biotechnology etc…
  • The last 10 years have seen recombinant DNA techniques pervade the whole of biology and biology-related fields. The use of plasmids, restriction enzymes, DNA sequencing methods and, more recently, PCR, have allowed the cloning and characterization of many genes and of their protein products. The growth in DNA sequence data available to researchers is phenomenal. For example, GenBank, a major database where molecular biologists store the DNA sequences they obtain and make them available, doubles in size approximately every 14 months. At the beginning of 1999, Genbank contained over 3 million sequence records, and grew at a rate in excess of a million nucleotides deposited per day! Genbank is shown here as an example, but other sequence databases would grow at similar rates. Source: genbank release notes, National Center for Biotechnology Information (
  • As the application of information technology to biology, bioinformatics pervades the whole of biology, including genetics, biochemistry, ecology and medicine. However, much of the publicity and emphasis which bioinformatics has received in the last few years has been on DNA and protein sequence analysis. Given the large amount of sequence data available and the rate at which it is growing, this is where the need for computer analysis has been felt the most. DNA and protein sequences are particularly amenable to computer analysis, since they can be represented by strings of letters, which computers are very apt to deal with. A DNA sequence is a string of 4 letters (A, C, G and T), and a protein sequence can also be represented by a string of 20 letters, each of which represents an amino acid
  • The next part of the lecture uses flowcharts to outline a range of procedures commonly used in computer-assisted biomolecular sequence analysis. This rather complicated flowchart summarizes this whole section of the lecture. The flowchart will be divided into four sections: Sequence entry: getting the sequence into the computer Nucleotide sequence analysis Protein sequence analysis Multiple sequence analysis (working with multiple sequence alignments) Each step of the flowchart will be examined in turn
  • 1 caagtcttct ttctccaagg aggatatgaa gcgttttcgg cttcctgccc tgagctgtgc 61 agcaaacagt ccacccccat ggggctcagc ctcccgctga gtactagtgt gcctgacagt 121 gcagaatccg gatgcagctc ctgtagcacc cctctctacg accagggggg cccagtggag 181 atcctgtcct tcctgtacct gggcagtgct taccatgctt cccggaaaga tatgctcgac 241 gccttgggta tcactgcttt gatcaacgtc tcggccaatt gtcctaacaa ctttgagggt 301 cactaccagt acaagagcat ccctgtggag gacaaccaca aggcagacat cagctcctgg 361 ttcaacgagg cgattgactt tatagactcc atcaaggatg ctggaggaag ggtgtttgtg 421 cactgccagg ccggcatctc caggtcagcc accatctgcc ttgcttacct catgaggact 481 aaccgagtga agctggacga ggcctttgag tttgtgaagc a
  • Multiple sequence alignments can therefore be used as input to create phylogenetic trees representing possible evolutionary relationships. The principle is that the more closely related two species, the more similar their homologous sequences will be (in general - there are many exceptions) For example, according to the above tree, B. subtilis and B. cereus are more closely related to each other than to C. botulinum, C. cadavers, C. butyricum or E. coli. This tree was created from an alignment of the 16s ribosomal RNA sequences from the various bacteria. Further reading: molecular phylogeny is a very large field in itself, with a lot of associated literature. A good introduction to the field can be found in: Swofford, Olsen, Waddell and Hillis (1996) “Phylogenetic inference” in Molecular Systematics (2nd ed), DM Hillis, C Moritz and BK Mable eds.Sinauer Associates, Inc. Sunderland MA, USA
  • PCR planning programs let the user specify criteria such as primer length, melting temperature, GC content etc...
  • This type of display is produced by the program mapplot , part of the GCG package. It lists the restriction enzymes which cut a particular sequence (together with their recognition sequence) and creates a graphical representation of the sequence with the cutting sites marked along a line representing the sequence. This type of image is useful for finding suitable restriction enzymes for subcloning a particular sequence fragment, or for producing a distinctive restriction pattern for in vitro diagnostic procedures. Enzyme name Recognition sequence cutting sites
  • This transfer RNA cloverleaf structure was predicted for a tRNA sequence using Michael Zuker’s program mfold , which has been incorporated in the GCG package. Further reading: M. Zuker, D.H. Mathews & D.H. Turner (1999) “Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide” In RNA Biochemistry and Biotechnology , J. Barciszewski & B.F.C. Clark, eds., NATO ASI Series, Kluwer Academic Publishers Also available online at
  • There are several approaches to building a 3 dimensional model for a protein: Homology modeling uses sequence similarity to map a sequence onto the known structure of a similar sequence (for example, using BLAST to search the PDB database) Profiling involves converting known structures into 3D profiles where the residue preference for each position is classified according to secondary structure (helix, strand, coil) and hydrophobicity/accessibility (exposed, partially exposed, buried). The query sequence can then be mapped onto a library of 3D profiles and the best matching profiles are selected. Threading also involves mapping a sequence onto a library of structures, but only structural information is used. Instead, pseudo-potential energy functions are used to evaluate residue-residue interactions. The query sequence is “threaded” through the various potential structures in the library and the folds yielding the lowest interaction energy when the sequence is mapped onto them are selected. For example, a fold which bring two residues of opposite charge close together will be considered a better fit than a fold which brings together two residues of the same charge or two large residues which would cause a steric clash. (Slide and notes courtesy of Dr Shoba Ranganathan, Australian Genomic Information Centre)
  • Large scale sequencing projects make use of automated sequencing machines connected to a computer. Because the sequencing machines are typically limited to 300-600 nucleotides, it is often necessary to break down large sequences into fragments, sequence these fragments, then reconstruct the original complete sequence by searching for regions in common between the gel readings, using specialized software. This picture shows some windows from gap4 , a sequencing project management program which is part of the Staden package. This type of software helps in the management of sequencing projects not only by assembling gel readings but also by searching and removing vector sequences, repeat sequences and poor quality sequence regions which can cause problems when assembling the fragments Further reading: Staden, R., Beal, K.F. and Bonfield, J.K. (1998) The Staden Package, Computer Methods in Molecular Biology Eds Stephen Misener and Steve Krawetz. The Humana Press Inc., Totowa, NJ 07512 Also available at:
  • Project report-on-bio-informatics

    1. 1. Bioinformatics – A Brief overview
    2. 2. What is bioinformatics? <ul><li>Application of information technology to the storage, management and analysis of biological information </li></ul><ul><li>Facilitated by the use of computers </li></ul>
    3. 3. Publically available genomes (April 1998) <ul><li>COMPLETE/PUBLIC </li></ul><ul><li>Aquifex aeolicus </li></ul><ul><li>Pyrococcus horikoshii </li></ul><ul><li>Bacillus subtilis </li></ul><ul><li>Treponema pallidum </li></ul><ul><li>Borrelia burgdorferi </li></ul><ul><li>Helicobacter pylori </li></ul><ul><li>. Escherichia coli </li></ul><ul><li>Mycoplasma pneumoniae </li></ul><ul><li>Saccharomyces cerevisiae </li></ul><ul><li>Mycoplasma genitalium </li></ul><ul><li>Haemophilus influenzae </li></ul>COMPLETE/PENDING PUBLICATION Rickettsia prowazekii Pseudomonas aeruginosa Pyrococcus abyssii Bacillus sp. C-125 Ureaplasma urealyticum Pyrobaculum aerophilum ALMOST/PUBLIC Pyrococcus furiosus Mycobacterium tuberculosis H37Rv Mycobacterium tuberculosis CSU93 Neisseria gonorrhea Neisseria meningiditis Streptococcus pyogenes
    4. 4. Promises of genomics and bioinformatics <ul><li>Medicine </li></ul><ul><ul><li>Knowledge of protein structure facilitates drug design </li></ul></ul><ul><ul><li>Understanding of genomic variation allows the tailoring of medical treatment to the individual’s genetic make-up </li></ul></ul><ul><ul><li>Genome analysis allows the targeting of genetic diseases </li></ul></ul><ul><ul><li>The effect of a disease or of a therapeutic on RNA and protein levels can be elucidated </li></ul></ul><ul><li>The same techniques can be applied to biotechnology, crop and livestock improvement, etc... </li></ul>
    5. 5. The need for bioinformaticists. The number of entries in data bases of gene sequences is increasing exponentially. Bioinformaticians are needed to understand and use this information . 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 GenBank growth
    6. 6. What Can be done using bioinformatics? <ul><li>Sequence analysis </li></ul><ul><ul><li>Geneticists/ molecular biologists analyse genome sequence information to understand disease processes </li></ul></ul><ul><li>Molecular modeling </li></ul><ul><ul><li>Crystallographers/ biochemists design drugs using computer-aided tools </li></ul></ul><ul><li>Phylogeny/evolution </li></ul><ul><ul><li>Geneticists obtain information about the evolution of organisms by looking for similarities in gene sequences </li></ul></ul><ul><li>Ecology and population studies </li></ul><ul><ul><li>Bioinformatics is used to handle large amounts of data obtained in population studies </li></ul></ul><ul><li>Medical informatics </li></ul><ul><ul><li>Personalised medicine </li></ul></ul>
    7. 7. NCBI (National centre for Biotechnology information ) <ul><li>Entrez Protein </li></ul><ul><li>DNA </li></ul><ul><li>EMBL, DDBJ, GENEBANK </li></ul><ul><li>SRS GENOME </li></ul><ul><li>Pubmed Annotation </li></ul><ul><li>Medline </li></ul><ul><li>PIR </li></ul><ul><li>Swissprot </li></ul><ul><li>PDB </li></ul>
    8. 8. What can be discovered about a gene by a database search? <ul><li>A little or a lot, depending on the gene </li></ul><ul><ul><li>Evolutionary information : homologous genes, taxonomic distributions, allele frequencies, synteny, etc. </li></ul></ul><ul><ul><li>Genomic information : chromosomal location, introns, UTRs, regulatory regions, shared domains, etc. </li></ul></ul><ul><ul><li>Structural information : associated protein structures, fold types, structural domains </li></ul></ul><ul><ul><li>Expression information : expression specific to particular tissues, developmental stages, phenotypes, diseases, etc. </li></ul></ul><ul><ul><li>Functional information : enzymatic/molecular function, pathway/cellular role, localization, role in diseases </li></ul></ul>
    9. 9. Databases <ul><li>Three types of databases </li></ul><ul><li>Primary – Sequence database </li></ul><ul><li>Secondary- Annotation </li></ul><ul><li>Tertiary- structure database </li></ul><ul><li>Two other types </li></ul><ul><li>DNA database - Genebank,DDBJ,EMBL </li></ul><ul><li>Protein databases – PIR,SwissProt,MIPS </li></ul>
    10. 10. Biological databanks and databases <ul><li>Very fast growth of biological data </li></ul><ul><li>Diversity of biological data: </li></ul><ul><ul><li>primary sequences </li></ul></ul><ul><ul><li>3D structures </li></ul></ul><ul><ul><li>functional data </li></ul></ul><ul><li>Database entry usually required for publication </li></ul><ul><ul><li>Sequences </li></ul></ul><ul><ul><li>Structures </li></ul></ul><ul><li>Database entry may replace primary publication </li></ul><ul><ul><li>genomic approaches </li></ul></ul>Bioinformatics
    11. 11. PubMed
    12. 13. Sequence analysis: overview Nucleotide sequence file Search databases for similar sequences Sequence comparison Multiple sequence analysis <ul><li>Design further experiments </li></ul><ul><ul><li>Restriction mapping </li></ul></ul><ul><ul><li>PCR planning </li></ul></ul>Translate into protein Search for known motifs RNA structure prediction non-coding coding Protein sequence analysis Search for protein coding regions Manual sequence entry Sequence database browsing Sequencing project management Protein sequence file Search databases for similar sequences Sequence comparison Search for known motifs Predict secondary structure Predict tertiary structure Create a multiple sequence alignment Edit the alignment Format the alignment for publication Molecular phylogeny Protein family analysis Nucleotide sequence analysis Sequence entry
    13. 14. Sequence comparison <ul><li>Pairwise sequence alignment </li></ul><ul><li>Blast - BlastP,BlastN,nBlastP </li></ul><ul><li>Multiple sequence alignment </li></ul><ul><li>ClustalW,ClustalX </li></ul><ul><li>User interface </li></ul><ul><li>Bioedit </li></ul><ul><li>Biology Workbench </li></ul><ul><li>CLC Workbench </li></ul>
    14. 15. Click on:
    15. 16. Database Search
    16. 18. Multiple Sequence Alignment: Approaches <ul><li>Optimal Global Alignments -Dynamic programming </li></ul><ul><ul><li>Generalization of Needleman-Wunsch </li></ul></ul><ul><ul><li>Find alignment that maximizes a score function </li></ul></ul><ul><ul><li>Computationally expensive: Time grows as product of sequence lengths </li></ul></ul><ul><li>Global Progressive Alignments - Match closely-related sequences first using a guide tree </li></ul><ul><li>Global Iterative Alignments - Multiple re-building attempts to find best alignment </li></ul><ul><li>Local alignments </li></ul><ul><ul><li>Profiles, Blocks, Patterns </li></ul></ul>
    17. 19. CLUSTALW MSA
    18. 20. Phylogeny inference: Analysis of sequences allows evolutionary relationships to be determined E.coli C.botulinum C.cadavers C.butyricum B.subtilis B.cereus Phylogenetic tree constructed using the Phylip package
    19. 21. gene prediction software <ul><li>Similarity-based or Comparative </li></ul><ul><ul><li>BLAST </li></ul></ul><ul><ul><li>SGP2 (extension of GeneID) </li></ul></ul><ul><li>Ab initio = “from the beginning” </li></ul><ul><ul><li>GeneID </li></ul></ul><ul><ul><li>GENSCAN </li></ul></ul><ul><ul><li>GeneMark </li></ul></ul><ul><ul><li>Combined &quot;evidence-based” </li></ul></ul><ul><ul><li>GeneSeqer (Brendel et al., ISU) </li></ul></ul><ul><li>BEST- GENSCAN, GeneMark.hmm, GeneSeqer </li></ul><ul><li>but depends on organism & specific task </li></ul>
    20. 22. PCR Primer Design: <ul><li>Oligonucleotides for use in the polymerisation chain reaction can be designed using computer based prgrams </li></ul>OPTIMAL primer length --> 20 MINIMUM primer length --> 18 MAXIMUM primer length --> 22 OPTIMAL primer melting temperature --> 60.000 MINIMUM acceptable melting temp --> 57.000 MAXIMUM acceptable melting temp --> 63.000 MINIMUM acceptable primer GC% --> 20.000 MAXIMUM acceptable primer GC% --> 80.000 Salt concentration (mM) --> 50.000 DNA concentration (nM) --> 50.000 MAX no. unknown bases (Ns) allowed --> 0 MAX acceptable self-complementarity --> 12 MAXIMUM 3' end self-complementarity --> 8 GC clamp how many 3' bases --> 0
    21. 23. Restriction mapping: Genes can be analysed to detect gene sequences that can be cleaved with restriction enzymes AceIII 1 CAGCTCnnnnnnn’nnn... AluI 2 AG’CT AlwI 1 GGATCnnnn’n_ ApoI 2 r’AATT_y BanII 1 G_rGCy’C BfaI 2 C’TA_G BfiI 1 ACTGGG BsaXI 1 ACnnnnnCTCC BsgI 1 GTGCAGnnnnnnnnnnn... BsiHKAI 1 G_wGCw’C Bsp1286I 1 G_dGCh’C BsrI 2 ACTG_Gn’ BsrFI 1 r’CCGG_y CjeI 2 CCAnnnnnnGTnnnnnn... CviJI 4 rG’Cy CviRI 1 TG’CA DdeI 2 C’TnA_G DpnI 2 GA’TC EcoRI 1 G’AATT_C HinfI 2 G’AnT_C MaeIII 1 ’GTnAC_ MnlI 1 CCTCnnnnnn_n’ MseI 2 T’TA_A MspI 1 C’CG_G NdeI 1 CA’TA_TG Sau3AI 2 ’GATC_ SstI 1 G_AGCT’C TfiI 2 G’AwT_C Tsp45I 1 ’GTsAC_ Tsp509I 3 ’AATT_ TspRI 1 CAGTGnn’ 50 100 150 200 250
    22. 24. RNA structure prediction: Structural features of RNA can be predicted G G A C A G G A G G A U A C C G C G G U C C U G C C G G U C C U C A C U U G G A C U U A G U A U C A U C A G U C U G C G C A A U A G G U A A C G C G U
    23. 25. Protein Structure : the 3-D structure of proteins is used to understand protein function and design new drugs
    24. 26. Gene Sequencing: Automated chemcial sequencing methods allow rapid generation of large data banks of gene sequences
    25. 27. Structural Bioinformatics
    26. 28. Structural Bioinformatics <ul><li>Prediction of structure from sequence </li></ul><ul><ul><li>secondary structure </li></ul></ul><ul><ul><li>homology modelling, threading </li></ul></ul><ul><ul><li>ab initio 3D prediction </li></ul></ul><ul><li>Analysis of 3D structure </li></ul><ul><ul><li>structure comparison/ alignment </li></ul></ul><ul><ul><li>prediction of function from structure </li></ul></ul><ul><ul><li>molecular mechanics/ molecular dynamics </li></ul></ul><ul><ul><li>prediction of molecular interactions, docking </li></ul></ul><ul><li>Structure databases (RCSB) </li></ul>
    27. 30. Bioinformatics key areas organisation of knowledge (sequences, structures, functional data) <ul><ul><li>e.g. homology searches </li></ul></ul>
    28. 31. Molecular modeling <ul><li>Homology model </li></ul><ul><li>Comparative modeling </li></ul><ul><li>Modellar </li></ul><ul><li>SwissPDB Viwer </li></ul><ul><li>Genetraeder </li></ul><ul><li>MOLMOD </li></ul>
    29. 32. Molecular visualization <ul><li>Rasmol </li></ul><ul><li>CN3D </li></ul><ul><li>Jmol </li></ul><ul><li>Pymol </li></ul><ul><li>Jmol </li></ul>
    31. 34. Tertiary Structure prediction CPHmodel
    32. 35. Active Site Prediction