An Introduction To


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

An Introduction To

  1. 1. <ul><li>An introduction to </li></ul><ul><li>BIO informatics Algo RITHMS </li></ul><ul><li>Dr. S. Parthasarathy </li></ul><ul><li>National Institute of Technology </li></ul><ul><li>Tiruchirappalli – 620 015 </li></ul><ul><li>(E-mail: </li></ul>
  2. 2. Plan <ul><li>Introduction </li></ul><ul><li>Overview of Bioinformatics </li></ul><ul><li>Bioinformatics Algorithms </li></ul><ul><ul><li>Pairwise Sequence Alignment </li></ul></ul><ul><ul><li>Database Search </li></ul></ul><ul><li>PickFold – Sequence to Protein Fold </li></ul><ul><li>Future Perspective of Biological Crisis Management – Bioinformatics point of view </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Introduction Biological Data <ul><li>On 26 June, 2000 - Announcement of completion of the draft of the ‘Human Genome’ </li></ul><ul><li>Human Genome contains 3.2 x 10 9 bps </li></ul><ul><li>Units of (Genome) sequence length </li></ul><ul><ul><li>bps ( b ase p air s) </li></ul></ul><ul><ul><li>Mbps ( M ega b ase p air s) = 10 6 bps </li></ul></ul><ul><ul><li>Gbps ( G iga b ase p air s) = 10 9 bps </li></ul></ul><ul><ul><li>huge ( hu man g enome e quivalent) = 3.2 Gbps </li></ul></ul>
  4. 4. Biological Data Explosion
  5. 5. Biological Data explosion <ul><li>GenBank, NCBI, USA --- 16 Gbps </li></ul><ul><ul><li>GenBank, National Center for Biotechnology Information, USA </li></ul></ul><ul><li>PDB, RCSB, USA --- 16,000 structures </li></ul><ul><ul><li>PDB, Research Collaboratory for Structural Bioinformatics, USA </li></ul></ul><ul><li>QUALITY - HIGH </li></ul><ul><ul><li>Experimental error in modern genomic sequencing is extremely low </li></ul></ul><ul><li>QUANTITY - HUGE </li></ul><ul><ul><li>With genomic sequencing & Recombinant DNA technology, size of sequence data bases is increasing very rapidly. </li></ul></ul>
  6. 6. Bioinformatics - Definition F(i,j) = max { F(i-1, j-1)+s(x i ,y j ), F(i-1, j) – d, F(i, j-1) – d.} Bioinformatics is an integration of mathematical, statistical and computer methods to analyze biological data. We use computer programs to make inference from the biological data, to make connections among them and to derive useful and interesting predictions. The marriage of biology and computer science has created a new field called ‘Bioinformatics’.
  7. 7. Bioinformatic Goals <ul><li>To understand integrative aspects of the biology of organisms, viewed as coherent complex structures </li></ul><ul><li>To interrelate sequences, 3-D structures, interactions and functions of proteins, nucleic acids and their complexes </li></ul><ul><li>To study the evolution of biological systems </li></ul><ul><li>To support applications in agricultural, pharmaceutical and other scientific fields </li></ul>
  8. 8. Biological Systems Overview <ul><li>BIOSPHERE </li></ul>SPECIES ECO SYSTEMS ORGANISMS CELLS
  9. 9. Biology Basic Definitions <ul><li>Cell - It is the building block of living organisms </li></ul><ul><ul><li>Eukaryotic Cells or organisms have the nucleus separated from the cytoplasam by a nuclear membrane and the genetic material borne on a number of chromosomes consisting of DNA and Protein </li></ul></ul><ul><li>Chromosome </li></ul><ul><ul><li>The physical basis of heredity. Deeply staining </li></ul></ul><ul><ul><li>rod-like structures present with the nuclei of eukaryotes </li></ul></ul><ul><ul><ul><li>Contains DNA and protein arranged in compact manner </li></ul></ul></ul><ul><ul><ul><li>Replicate identically during cell division </li></ul></ul></ul><ul><ul><ul><li>Same number of chromosomes present in cells of a particular species (e.g. Human : 22, X and Y) </li></ul></ul></ul>
  10. 10. Genome Basic Definitions <ul><li>Gene </li></ul><ul><ul><li>One of the units of inherited material carried on by chromosomes. They are arranged in a linear fashion on DNAs. Each represents one character, which is recognized by its effect on the individual bearing the gene in its cells. There are many thousand genes in each nucleus. </li></ul></ul><ul><li>Genome </li></ul><ul><ul><li>A set of chromosomes inherited from one parent </li></ul></ul><ul><li>DNA Deoxyribo Nucleic Acid </li></ul><ul><ul><li>made up of FOUR bases </li></ul></ul><ul><ul><li>a t g c – adenine, thymine, guanine, cytosine </li></ul></ul><ul><li>Proteins </li></ul><ul><ul><li>made up of TWENTY different aminoacids </li></ul></ul><ul><ul><li>A T G C … – Alanine, Threonine, Glycine, Cysteine, … </li></ul></ul>
  11. 11. Bioinformatics Tasks <ul><li>Sequence Analysis </li></ul><ul><ul><li>Similarity & Homology – </li></ul></ul><ul><ul><ul><li>pairwise local/global alignment </li></ul></ul></ul><ul><ul><ul><ul><li>Scoring Matrices - PAM, BLOSUM </li></ul></ul></ul></ul><ul><ul><li>Database Search </li></ul></ul><ul><ul><ul><li>BLAST, FASTA, GCG </li></ul></ul></ul><ul><ul><li>Multiple alignment </li></ul></ul><ul><ul><ul><li>ClustalW, PRINTS, BLOCKS </li></ul></ul></ul><ul><ul><li>Secondary Structure Prediction </li></ul></ul><ul><ul><ul><li>Proteins –  -Helix, β -Sheet, Turn or coil </li></ul></ul></ul><ul><ul><ul><li>Protein Folding </li></ul></ul></ul>
  12. 12. Bioinformatics Tasks <ul><li>Structure analysis </li></ul><ul><ul><li>X-ray crystallograpy – 3 dimensional coordinates – Structure </li></ul></ul><ul><ul><ul><li>PDB – Protein Data Bank </li></ul></ul></ul><ul><ul><ul><li>RasMol – Molecular Viewing Software </li></ul></ul></ul><ul><li>Protein Structure Databases </li></ul><ul><ul><li>SCOP - S tructural C lassification O f P roteins </li></ul></ul><ul><ul><li>CATH - C lass, A rchitecture, T opology, H omologous superfamily </li></ul></ul><ul><ul><li>FSSP - F old Classification based on S tructure- S tructure alignment </li></ul></ul><ul><ul><li>of P roteins – obtained by DALI (D istance-matrix </li></ul></ul><ul><ul><li>ALI gnment) </li></ul></ul>
  13. 13. Bioinformatics Tasks <ul><li>Protein Engineering </li></ul><ul><ul><li>Mutations </li></ul></ul><ul><ul><ul><li>Alter particular aminoacid/base for desired effect </li></ul></ul></ul><ul><ul><li>Site directed mutagenesis </li></ul></ul><ul><ul><ul><li>Identify the potential sites where we can do alterations </li></ul></ul></ul><ul><li>DNA Bending </li></ul><ul><ul><li>Application to Genomes </li></ul></ul>
  14. 14. Sequence similarity, homology and alignments <ul><li>Nature is a tinkerer and not an inventor. New sequences are adapted from pre-existing sequences rather than invented de novo . There exists significant similarity between a new sequence and already known sequences. </li></ul><ul><li>Similarity – Measurement of resemblance and differences, independent of the source of resemblance. </li></ul><ul><li>Homology – The sequences and the organisms in which they occur are descended from a common ancestor. </li></ul><ul><li>If two related sequences are homologous, then we can transfer information about structure and/or function, by homology. </li></ul>
  15. 15. Sequence Comparison <ul><li>Issues </li></ul><ul><ul><li>Types of alignment </li></ul></ul><ul><ul><ul><li>Global – end to end matching (Needleman-Wunsch) </li></ul></ul></ul><ul><ul><ul><li>Local – portions or subsequences matching (Smith-Waterman) </li></ul></ul></ul><ul><ul><li>Scoring system used to rank alignments </li></ul></ul><ul><ul><ul><li>PAM & BLOSUM matrices </li></ul></ul></ul><ul><ul><li>Algorithms used to find optimal (or good) </li></ul></ul><ul><ul><li>scoring alignments </li></ul></ul><ul><ul><ul><li>Heuristic </li></ul></ul></ul><ul><ul><ul><li>Dynamic Programming </li></ul></ul></ul><ul><ul><ul><li>Hidden Markov Model (HMM) </li></ul></ul></ul><ul><ul><li>Statistical methods used to evaluate the significance of an alignment score </li></ul></ul><ul><ul><ul><li>Z-score, E-value, etc. </li></ul></ul></ul>
  16. 16. Substitution Matrices <ul><li>PAM (Point Accepted Mutation) </li></ul><ul><li>BLOSUM (BLOcks SUbstitution Matrix) </li></ul>90 62 30 Close Default Distant 40 250 500 BLOSUM PAM
  17. 17. Types of Algorithms <ul><li>Heuristic </li></ul><ul><li>A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee. </li></ul><ul><li>In most cases, heuristic methods can be very fast , but they make additional assumptions and will miss the best match for some sequence pairs. </li></ul><ul><li>Dynamic Programming </li></ul><ul><li>The algorithm for finding optimal alignments given an additive alignment score dynamically </li></ul><ul><li>(We are going to discuss about it soon.) </li></ul><ul><li>These type of algorithms are guaranteed to find the optimal scoring alignment or set of alignments. </li></ul><ul><li>HMM - Based on Probability Theory – very versatile. </li></ul>
  18. 18. Global Alignment Needleman-Wunsch Algorithm <ul><li>Formula </li></ul><ul><li>F(i-1,j-1) + s(x i ,y j ) D </li></ul><ul><li>F(i, j) = max s F(i-1 , j) - d H </li></ul><ul><li>F(i , j-1) - d V </li></ul>F(i,j) F(i-1,j) H F(i,j-1) V F(i-1,j-1) D
  19. 19. Global Alignment Needleman-Wunsch Algorithm <ul><li>Gap penalties </li></ul><ul><ul><li>Linear score f(g) = - gd </li></ul></ul><ul><ul><li>Affine score f(g) = - d – (g-1) e </li></ul></ul><ul><ul><ul><li>d = gap open penalty e = gap extend penalty </li></ul></ul></ul><ul><ul><ul><li>g = gap length </li></ul></ul></ul><ul><li>Trace back </li></ul><ul><ul><li>Take the value in the bottom right corner and trace back till the end. ( i.e. align end – end always). </li></ul></ul><ul><li>Algorithm complexity </li></ul><ul><ul><li>It takes O(nm) time and O(nm) memory, where n and m are the lengths of the sequences. </li></ul></ul>
  20. 20. Local Alignment Smith-Waterman Algorithm <ul><li>Same as Global alignment algorithm with </li></ul><ul><li>TWO differences. </li></ul><ul><li>F(i,j) to take 0 (zero), if all other options have value less than 0. </li></ul><ul><li>Alignment can end anywhere in the matrix. </li></ul><ul><ul><li>Take the highest value of F(i,j) over the whole </li></ul></ul><ul><ul><li>matrix and start trace back from there. </li></ul></ul>
  21. 21. Sequence Database Search <ul><li>Heuristic sequence database searching packages </li></ul><ul><ul><li>BLAST & FASTA </li></ul></ul><ul><li>Significance of Score </li></ul><ul><ul><li>Z – score = (score – mean)/std. dev </li></ul></ul><ul><ul><ul><li>Measures how unusual our original match is. </li></ul></ul></ul><ul><ul><ul><li>Z  5 are significant. </li></ul></ul></ul><ul><ul><li>P – value measures probability that the alignment is no better than random. (Z and P depends on the distribution of the scores) </li></ul></ul><ul><ul><ul><li>P  10 -100 exact match. </li></ul></ul></ul><ul><ul><li>E – value is the expected number of sequences that give the same Z-score or better. (E = P x size of the database) </li></ul></ul><ul><ul><ul><li>E  0.02 sequences probably homologous </li></ul></ul></ul>
  22. 22. Web based server development <ul><li>Design the web page to get the data </li></ul><ul><li>Use cgi-bin or Perl script to parse the submitted data </li></ul><ul><li>Invoke the corresponding program to get the appropriate results </li></ul><ul><li>Send the results either by e-mail or to the web page directly </li></ul>
  23. 23. PickFold <ul><li>Predict a fold for an amino acid sequence </li></ul><ul><li>To develop a fold recognition technique that is sensitive in detecting folds of sequences in the twilight zone (sequences sharing less than 25% identity). </li></ul>
  24. 24. Workflow
  25. 25. PickFold <ul><li>Sequence to Protein Fold </li></ul><ul><li>Follows … </li></ul>
  26. 26. Biological Crisis Management Future Perspective of Biological Crisis Management Follows …
  27. 27. Applications of Bioinformatics <ul><li>Agricultural </li></ul><ul><ul><li>Genetically Modified Plants, Vegetables </li></ul></ul><ul><ul><li>GM Food </li></ul></ul><ul><li>Pharmaceutical </li></ul><ul><ul><li>Molecular Modelling based Drug Discovery </li></ul></ul><ul><li>Medical </li></ul><ul><ul><li>Gene Therapy </li></ul></ul>
  28. 28. Bioinformatics Skills <ul><li>Algorithm development </li></ul><ul><ul><li>Coding – Testing – Documentation </li></ul></ul><ul><ul><ul><li>Programming Skills in C, C++, Java, … </li></ul></ul></ul><ul><ul><ul><li>Data Structures – Sorting, Searching, Statistics & Probability </li></ul></ul></ul><ul><li>Database Management </li></ul><ul><ul><li>Creation, Compilation, Updation & Web based search </li></ul></ul><ul><ul><ul><li>CGI bin scripts, Java Scripts, Perl, JDBC, ASP, ... </li></ul></ul></ul><ul><li>Graphics </li></ul><ul><ul><li>2D & 3D graphics - GUI </li></ul></ul><ul><li>Web page design & Automatic Web servers </li></ul><ul><ul><li>Java Applets, Java Scripts, Java Servlets, RMI, … </li></ul></ul><ul><li>Commercial Products - Package/ Tools – Sales !! </li></ul>
  29. 29. Important Bioinformatics Resources <ul><li>NCBI, NIH - </li></ul><ul><li>EMBL, EBI - </li></ul><ul><li>ExPasy, Swiss - </li></ul><ul><li>DDBJ - </li></ul><ul><li>PDB - </li></ul><ul><li>GCG - </li></ul>
  30. 30. BIOINFORMATICS JOBS <ul><li>Bioinformatics Scientist / Analyst </li></ul><ul><li>Bio-programmer </li></ul><ul><li>Bioinformatics software engineer </li></ul><ul><li>Web Developer </li></ul><ul><li>Network Programmer </li></ul><ul><li>Database Programmer </li></ul><ul><li>System Engineer / Analyst </li></ul>
  31. 31. BT versus IT <ul><li>Bioinformatics including Biotechnology (BT) requires lot of Information Technology (IT) skills for Genomic annotation projects </li></ul><ul><li>Bioinformatics is one of the potential areas for IT professionals also </li></ul><ul><li>Genome Projects will be the next huge task for IT industries (like the Y2K problem in the past) </li></ul><ul><li>BT will take on IT soon … in the near future … </li></ul>
  32. 32. Conclusions <ul><li>Developing Web based Bioinformatics tools </li></ul><ul><ul><li>Develop/modify useful algorithms </li></ul></ul><ul><ul><li>Generate computer source codes </li></ul></ul><ul><ul><li>Create/Maintain Web based server </li></ul></ul><ul><li>Using existing Web based tools efficiently </li></ul><ul><li>Bio-ethics & Bio-safety </li></ul><ul><ul><li>Ensure always that any bioinformatics tool harmful to environment & society has neither been developed nor used by you </li></ul></ul>
  33. 33. References (latest) <ul><li>N. C. Jones and P. A. Pevzner, An Introduction to Bioinformatics Algorithms , Ane Books, New Delhi (2005). </li></ul><ul><li>Arthur M. Lesk, Introduction to Bioinformatics , Oxford University Press, New Delhi (2003). </li></ul><ul><li>D. Higgins and W. Taylor (Eds), Bioinformatics- Sequence structure and databank s , Oxford University Press, New Delhi (2000). </li></ul><ul><li>R.Durbin, S.R.Eddy, A.Krogh and G.Mitchison, Biological Sequence Analysis , Cambridge Univ. Press, Cambridge, UK (1998). </li></ul><ul><li>A. Baxevanis and B.F. Ouellette, Bioinformatics: A practical Guide to the Analysis of Genes and Proteins , Wiley-Interscience, Hoboken, NJ (1998). </li></ul><ul><li>Michael S. Waterman, Introduction to computational Biology , Chapman & Hall, (1995). </li></ul><ul><li>J. A. Clasel and M. P. Deutscher (Eds), Introduction to Biophysical Methods for Protein and Nucleic Acid Research , Academic press, New York (1995). </li></ul>
  34. 34. Lecture Notes <ul><li>Available at ICGEBnet </li></ul><ul><ul><li>Distant homology </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><li>Biorithms </li></ul></ul><ul><ul><ul><li> </li></ul></ul></ul>
  35. 35. <ul><li>Thank You </li></ul>