Bioinformatics A Biased Overview


Published on

Lecture given to graduate microbiology students on examples of work in bioinformatics. Date:

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 2D hyperbolic view of the phylogenetic tree, colored based on the origin of sequences (red, ocean data set from CVI; blue, NCBI NR) Alignment performed by MUSCLE from sequences identified in a joined ocean80_nr80 database by PDB-BLAST search. Visualization by HyperTree program from Sugen
  • Tuberculosis, which is caused by the bacterial pathogen Mycobacterium tuberculosis , is a leading cause of mortality among the infectious diseases. It has been estimated by the World Health Organization (WHO) that almost one-third of the world's population , around 2 billion people, is infected with the disease. Every year, more than 8 million people develop an active form of the disease, which claims the lives of nearly 2 million. This translates to over 4,900 deaths per day , and more than 95% of these are in developing countries. Despite the current global situation, antitubercular drugs have remained largely unchanged over the last four decades. The widespread use of these agents has provided a strong selective pressure for M.tuberculosis, thus encouraging the emergence of resistant strains. Multidrug resistant (MDR) tuberculosis is defined as resistance to the first-line drugs isoniazid and rifampin . The effective treatment of MDR tuberculosis necessitates long-term use of second-line drug combinations , an unfortunate consequence of which is the emergence of further drug resistance. Enter extensively drug resistant (XDR) tuberculosis - M.tuberculosis strains that are resistant to both isoniazid plus rifampin, as well as key second-line drugs . Since the only remaining drug classes exhibit such low potency and high toxicity , XDR tuberculosis is extremely difficult to treat. The rise of XDR tuberculosis around the world imposes a great threat on human health , therefore reinforcing the development of new antitubercular agents as an urgent priority. Very few Mtb proteins explored as drug targets
  • 3,996 proteins in TB proteome 749 solved structures in the PDB, representing a total of 284 proteins (7.2% coverage) ModBase contains homology models for entire TB proteome 1,446 ‘high quality’ homology models were added to the data set Structural coverage increased to 43.8% Retained only those models with a model score of > 0.7 and a Modpipe quality score of > 1.1 (2818 models). There were multiple models per protein. For each TB protein, chose the model with the best model score, and if they were equal, chose the model with the best Modpipe quality score (1703 models). However, 251 (+6) models were removed since they correspond to TB proteins that already have solved structures. 1446 models remained) Score for the reliability of a Model, derived from statistical potentials (F. Melo, R. Sanchez, A. Sali,2001 PDF ). A model is predicted to be good when the model score is higher than a pre-specified cutoff (0.7). A reliable model has a probability of the correct fold that is larger than 95%. A fold is correct when at least 30% of its Calpha atoms superpose within 3.5A of their correct positions. The ModPipe Protein Quality Score is a composite score comprising sequence identity to the template, coverage , and the three individual scores evalue , z-Dope and GA341 . We consider a MPQS of >1.1 as reliable
  • (nutraceuticals excluded)
  • Multi-target therapy may be more effective than single-target therapy to treat infectious diseases Most of the proteins listed are potential novel drug targets for the development of efficient anti-tuberculosis chemotherapeutics. GSMN-TB : Genome Scale Metabolic Reaction Network of M.tb (http://sysbio/ 849 reactions, 739 metabolites, 726 genes Can optimize the model for in vivo growth Carry out multiple gene inhibition and compute the maximal theoretical growth rate (if close to zero, that combination of genes is essential for growth)
  • Bioinformatics A Biased Overview

    1. 1. A {Biased} Overview of Bioinformatics with Examples Drawn from Our Own Work Philip E. Bourne Professor of Pharmacology UCSD [email_address] Bioinformatics - Overview
    2. 2. There Are Multiple Types of Informatics in the Life Sciences Bioinformatics - Overview Pharmacy Informatics Biomedical Informatics Bioinformatics Note: These are only representative examples Drug dosing Pharmacokinetics Pharmacy Information Systems EHR Decision support systems Hospital Information Systems Algorithms Genomics Proteomics Biological networks Systems Biology
    3. 3. There Are Multiple Types of Informatics in the Life Sciences Bioinformatics - Overview Pharmacy Informatics Biomedical Informatics Bioinformatics Controlled vocabularies Ontologies Literature searching Data management Pharmacogenomics Personalized medicine Note: These are only representative examples
    4. 4. Bioinformatics In One Slide Biological Experiment Data Information Knowledge Discovery Collect Characterize Compare Model Infer Sequence Structure Assembly Sub-cellular Cellular Organ Higher-life 90 05 Computing Power Sequencing Data 1 10 100 1000 10 5 95 00 Human Genome Project E.Coli Genome C.Elegans Genome 1 Small Genome/Mo. ESTs Yeast Genome Gene Chips Virus Structure Ribosome Model Metaboloic Pathway of E.coli Complexity Technology Brain Mapping Genetic Circuits Neuronal Modeling Cardiac Modeling Human Genome # People /Web Site 10 6 10 2 1 Virtual Communities 10 6 Blogs Facebook 1000 ’s GWAS The Omics Revolution Bioinformatics - Overview
    5. 5. Bioinformatics – One Definition <ul><li>The integration of biological data in digital form from different sources and possibly different scales (complexity), usually collected by others, and subsequent analyzed to offer new biological insights </li></ul>Bioinformatics - Overview
    6. 6. Biological Scales (Complexity) Bioinformatics - Overview Genomics Proteomics Protein-protein interactions Biological Networks Systems Biology We will look at an example of how bioinformatics is used at each scale
    7. 7. Some Thoughts on Genomic Data <ul><li>Its scary </li></ul><ul><li>Its time to consider cost vs benefit </li></ul><ul><li>Reductionism is not a dirty word </li></ul><ul><li>We need to do more with the long tail </li></ul>On the Future of Genomic Data Science 11 February 2011: vol. 331 no. 6018 728-729
    8. 8. Bioinformatics & Metagenomics <ul><li>New type of genomics </li></ul><ul><li>New data (and lots of it) and new types of data </li></ul><ul><ul><li>17M new (predicted proteins!) 4-5 x growth in just few months and much more coming </li></ul></ul><ul><ul><li>New challenges and exacerbation of old challenges </li></ul></ul>Bioinformatics at Different Scales - Genomics Bioinformatics - Overview
    9. 9. Metagenomics: Early Results <ul><li>More then 99.5% of DNA in every environment studied represent unknown organisms </li></ul><ul><li>Most genes represent distant homologs of known genes, but there are thousands of new families </li></ul><ul><li>Environments being studied: </li></ul><ul><ul><li>Water (ocean, lakes) </li></ul></ul><ul><ul><li>Air </li></ul></ul><ul><ul><li>Soil </li></ul></ul><ul><ul><li>Human body (gut, oral cavity, human microbiome) </li></ul></ul>Bioinformatics at Different Scales - Genomics Bioinformatics - Overview
    10. 10. Metagenomics New Discoveries Environmental (red) vs. Currently Known PTPases (blue) Higher eukaryotes 1 2 3 4 Bioinformatics at Different Scales - Genomics Bioinformatics - Overview
    11. 11. Proteomics Bioinformatics - Overview
    12. 12. Its Not Just About Numbers its About Complexity Number of released entries Year Courtesy of the RCSB Protein Data Bank Bioinformatics at Different Scales - Proteomics Bioinformatics - Overview
    13. 13. Determining 3D Structures – The Impact of Bioinformatics Structural biology moves from being functionally driven to genomically driven Fill in protein fold space Robotics -ve data Software engineering Functional prediction Not necessarily Bioinformatics at Different Scales - Proteomics Bioinformatics - Overview Basic Steps Target Selection <ul><li>Crystallomics </li></ul><ul><li>Isolation, </li></ul><ul><li>Expression, </li></ul><ul><li>Purification, </li></ul><ul><li>Crystallization </li></ul>Data Collection Structure Solution Structure Refinement Functional Annotation Publish
    14. 14. Bioinformatics at Different Scales - Proteomics Bioinformatics - Overview
    15. 15. Nature ’s Reductionism There are ~ 20 300 possible proteins >>>> all the atoms in the Universe ~20M protein sequences from UniProt/TrEMBL ~75,000 protein structures Yield ~1500 folds, ~2000 superfamilies, ~4000 families (SCOP 1.75) Using Protein Structure to Study Evolution
    16. 16. Structure Provides an Evolutionary Fingerprint Distribution among the three kingdoms as taken from SUPERFAMILY <ul><li>Superfamily distributions would seem to be related to the complexity of life </li></ul>1 153/14 9/1 21/2 310/0 645/49 29/0 68/0 Any genome / All genomes Using Protein Structure to Study Evolution
    17. 17. Method – Distance Determination Presence/Absence Data Matrix Distance Matrix Using Protein Structure to Study Evolution (FSF) SCOP SUPERFAMILY organisms C. intestinalis C. briggsae F. rubripes a.1.1 1 1 1 a.1.2 1 1 1 a.10.1 0 0 1 a.100.1 1 1 1 a.101.1 0 0 0 a.102.1 0 1 1 a.102.2 1 1 1 C. intestinalis C. briggsae F. rubripes C. intestinalis 0 101 109 C. briggsae 0 144 F. rubripes 0
    18. 18. If Structure is so Conserved is it a Useful Tool in the Study of Evolution? The Answer Would Appear to be Yes <ul><li>It is possible to generate a reasonable tree of life from merely the presence or absence of superfamilies (FSFs) within a given proteome </li></ul>Using Protein Structure to Study Evolution Yang, Doolittle & Bourne (2005) PNAS 102(2) 373-8
    19. 19. The Influence of Environment on Life Chris Dupont Scripps Institute of Oceanography UCSD DuPont, Yang, Palenik, Bourne. 2006 PNAS 103(47) 17822-17827 Using Protein Structure to Study Evolution
    20. 20. Consider the Distribution of Disulfide B onds among Folds <ul><li>Disulphides are only stable under oxidiz ing condition s </li></ul><ul><li>Oxygen content gradually accumulated during the earth ’s evolution </li></ul><ul><li>The divergence of the three kingdoms occurred 1.8-2.2 billion years ago </li></ul><ul><li>Oxygen began to accumulate ~ 2.0 billion years ago </li></ul><ul><li>Logical deduction – disulfides more prevalent in folds (organisms) that evolved later </li></ul><ul><li>This would seem to hold true </li></ul><ul><li>Can we take this further? </li></ul>1 Using Protein Structure to Study Evolution
    21. 21. Evolution of the Earth <ul><li>4.5 billion years of change </li></ul><ul><li>300 + 50K </li></ul><ul><li>1-5 atmospheres </li></ul><ul><li>Constant photoenergy </li></ul><ul><li>Chemical and geological changes </li></ul><ul><li>Life has evolved in this time </li></ul><ul><li>The ocean was the “cradle” for 90% of evolution </li></ul>Using Protein Structure to Study Evolution
    22. 22. <ul><li>Whether the deep ocean became oxic or euxinic following the rise in atmospheric oxygen (~2.3 Gya) is debated, therefore both are shown (oxic ocean-solid lines, euxinic ocean-dashed lines). </li></ul><ul><li>The phylogenetic tree symbols at the top of the figure show one idea as to the theoretical periods of diversification for each Superkingdom. </li></ul>Theoretical Levels of Trace Metals and Oxygen in the Deep Ocean Through Earth ’s History Replotted from Saito et al, 2003 Inorganica Chimica Acta 356: 308-318 Using Protein Structure to Study Evolution
    23. 23. The Gaia Hypothesis <ul><li>Gaia - a complex entity involving the Earth's biosphere , atmosphere , oceans , and soil ; the totality constituting a feedback system which seeks an optimal physical and chemical environment for life on this planet. </li></ul>James Lovelock Gaia (pronounced /'geɪ.ə/ or /'gaɪ.ə/) &quot;land&quot; or &quot;earth&quot;, from the Greek Γαῖα ; is a Greek goddess personifying the Earth Using Protein Structure to Study Evolution
    24. 24. The Question <ul><li>Have the emergent properties of an organism as judged by its protein content been influenced by the environment? </li></ul><ul><li>Will do this by consideration of the metallomes of a broad range of species </li></ul><ul><li>The metallomes can only be deduced by consideration of the protein structures to which the metal is covalently bound </li></ul><ul><li>Will hypothesize that these emergent properties in turn influenced the environment </li></ul>Using Protein Structure to Study Evolution
    25. 25. Making the Metallome of Each Species – Can Only be Done from Structure and Requires Human Effort <ul><li>Start with SCOP </li></ul><ul><li>Each {super}family level assignment was checked manually for metal binding </li></ul><ul><li>All the structures representing the family had to bind the metal for it to be considered unambiguous </li></ul><ul><li>The literature was consulted to resolve ambiguities </li></ul><ul><li>Superfamily database used to map to proteomes </li></ul><ul><li>23 Archaea, 233 Bacteria, 57 Eukaryota </li></ul><ul><li>Cu, Ni, Mo ignored (<0.3%) of proteome </li></ul>Using Protein Structure to Study Evolution
    26. 26. Levels of Ambiguity <ul><li>Ambiguous superfamily binds different metals or have members that are not known to bind metals </li></ul><ul><li>Ditto families </li></ul><ul><li>Approx 50% of superfamilies and 10% of families are ambiguous </li></ul><ul><li>Only unambiguous families used in this study </li></ul>Using Protein Structure to Study Evolution
    27. 27. Superfamily Distribution As Well As Overall Content Has Changed Using Protein Structure to Study Evolution
    28. 28. Metal Binding Proteins are Not Consistent Across Superkingdoms Since these data are derived from current species they are independent of evolutionary events such as duplication, gene loss, horizontal transfer and endosymbiosis Using Protein Structure to Study Evolution
    29. 29. Power Laws: Fundamental Constants in the Evolution of Proteomes <ul><li>A slope of 1 indicates that a group of structural domains is in equilibrium with genome growth, while a slope > 1 indicates that the group of domains is being preferentially duplicated (or retained in the case of genome reductions). </li></ul>van Nimwegen E (2006) in: Koonin EV, Wolf YI, Karev GP, (Ed.). Power laws, scale-free networks, and genome biology Using Protein Structure to Study Evolution
    30. 30. Why are the Power Laws Different for Each Superkingdom? <ul><li>Power laws are likely influenced by selective pressure. Qualitatively, the differences in the power law slopes describing Eukarya and Prokarya are correlated to the shifts in trace metal geochemistry that occur with the rise in oceanic oxygen </li></ul><ul><li>We hypothesize that proteomes contain an imprint of the environment at the time of the last common ancestor in each Superkingdom </li></ul><ul><li>This suggests that Eukarya evolved in an oxic environment, whereas the Prokarya evolved in anoxic environments </li></ul>Using Protein Structure to Study Evolution
    31. 31. Do the Metallomes Contain Further Support for this Hypothesis? Using Protein Structure to Study Evolution
    32. 32. e - Transfer Proteins Same Broad Function, Same Metal, Different Chemistry Induced by the Environment? <ul><li>Fe-S clusters </li></ul><ul><li>Fe bound by S </li></ul><ul><li>Cluster held in place by Cys </li></ul><ul><li>Generally negative reduction potentials </li></ul><ul><li>Very susceptible to oxidation </li></ul><ul><li>Cytochromes </li></ul><ul><li>Fe bound by heme (and amino-acids) </li></ul><ul><li>Generally positive reduction potentials </li></ul><ul><li>Less susceptible to oxidation </li></ul>Using Protein Structure to Study Evolution
    33. 33. Hypothesis <ul><li>Emergence of cyanobacteria changed oxygen concentrations </li></ul><ul><li>Impacted relative metal ion concentrations in the ocean </li></ul><ul><li>Organisms evolved to use these metals in new ways to evolve new biological processes eg complex signaling </li></ul><ul><li>This in turn further impacted the environment </li></ul><ul><li>Only protein structures could reveal such dependencies </li></ul>Using Protein Structure to Study Evolution
    34. 34. Bioinformatics in the Context of Drug Discovery Bioinformatics - Overview
    35. 35. Our Motivation <ul><li>Tykerb – Breast cancer </li></ul><ul><li>Gleevac – Leukemia, GI cancers </li></ul><ul><li>Nexavar – Kidney and liver cancer </li></ul><ul><li>Staurosporine – natural product – alkaloid – uses many e.g., antifungal antihypertensive </li></ul>Collins and Workman 2006 Nature Chemical Biology 2 689-700 Motivators
    36. 36. A Reverse Engineering Approach to Drug Discovery Across Gene Families Characterize ligand binding site of primary target (Geometric Potential) Identify off-targets by ligand binding site similarity (Sequence order independent profile-profile alignment) Extract known drugs or inhibitors of the primary and/or off-targets Search for similar small molecules Dock molecules to both primary and off-targets Statistics analysis of docking score correlations … Computational Methodology Xie and Bourne 2009 Bioinformatics 25(12) 305-312
    37. 37. The Problem with Tuberculosis <ul><li>One third of global population infected </li></ul><ul><li>1.7 million deaths per year </li></ul><ul><li>95% of deaths in developing countries </li></ul><ul><li>Anti-TB drugs hardly changed in 40 years </li></ul><ul><li>MDR-TB and XDR-TB pose a threat to human health worldwide </li></ul><ul><li>Development of novel, effective and inexpensive drugs is an urgent priority </li></ul>Repositioning - The TB Story
    38. 38. The TB-Drugome <ul><li>Determine the TB structural proteome </li></ul><ul><li>Determine all known drug binding sites from the PDB </li></ul><ul><li>Determine which of the sites found in 2 exist in 1 </li></ul><ul><li>Call the result the TB-drugome </li></ul>A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
    39. 39. 1. Determine the TB Structural Proteome <ul><li>High quality homology models from ModBase ( increase structural coverage from 7.1% to 43.3% </li></ul>284 1, 446 3, 996 2, 266 TB proteome homology models solved structures A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
    40. 40. 2. Determine all Known Drug Binding Sites in the PDB <ul><li>Searched the PDB for protein crystal structures bound with FDA-approved drugs </li></ul><ul><li>268 drugs bound in a total of 931 binding sites </li></ul>No. of drug binding sites Methotrexate Chenodiol Alitretinoin Conjugated estrogens Darunavir Acarbose A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
    41. 41. Map 2 onto 1 – The TB-Drugome Similarities between the binding sites of M.tb proteins (blue), and binding sites containing approved drugs (red).
    42. 42. From a Drug Repositioning Perspective <ul><li>Similarities between drug binding sites and TB proteins are found for 61/268 drugs </li></ul><ul><li>41 of these drugs could potentially inhibit more than one TB protein </li></ul>No. of potential TB targets raloxifene alitretinoin conjugated estrogens & methotrexate ritonavir testosterone levothyroxine chenodiol A Multi-target/drug Strategy Kinnings et al 2010 PLoS Comp Biol 6(11): e1000976
    43. 43. Top 5 Most Highly Connected Drugs Drug Intended targets Indications No. of connections TB proteins levothyroxine transthyretin, thyroid hormone receptor α & β -1, thyroxine-binding globulin, mu-crystallin homolog, serum albumin hypothyroidism, goiter, chronic lymphocytic thyroiditis, myxedema coma, stupor 14 adenylyl cyclase, argR , bioD, CRP/FNR trans. reg ., ethR , glbN , glbO, kasB , lrpA , nusA , prrA , secA1 , thyX , trans. reg. protein alitretinoin retinoic acid receptor RXR- α , β & γ , retinoic acid receptor α , β & γ -1&2, cellular retinoic acid-binding protein 1&2 cutaneous lesions in patients with Kaposi's sarcoma 13 adenylyl cyclase, aroG , bioD, bpoC, CRP/FNR trans. reg. , cyp125 , embR , glbN , inhA , lppX , nusA , pknE , purN conjugated estrogens estrogen receptor menopausal vasomotor symptoms, osteoporosis, hypoestrogenism, primary ovarian failure 10 acetylglutamate kinase, adenylyl cyclase, bphD , CRP/FNR trans. reg. , cyp121 , cysM, inhA , mscL , pknB , sigC methotrexate dihydrofolate reductase, serum albumin gestational choriocarcinoma, chorioadenoma destruens, hydatidiform mole, severe psoriasis, rheumatoid arthritis 10 acetylglutamate kinase, aroF , cmaA2 , CRP/FNR trans. reg. , cyp121 , cyp51 , lpd , mmaA4 , panC , usp raloxifene estrogen receptor, estrogen receptor β osteoporosis in post-menopausal women 9 adenylyl cyclase, CRP/FNR trans. reg., deoD, inhA, pknB , pknE , Rv1347c , secA1, sigC
    44. 44. Systems Biology & Drug Discovery Chang et al. 2010 Plos Comp. Biol. 6(9): e1000938 Bioinformatics - Overview
    45. 45. Bioinformatics & Patient Care Bioinformatics - Overview
    46. 46. 7. Social Change Josh Sommer and Chordoma Disease
    47. 47. 5. Personalized Medicine
    48. 48. Additional Reading <ul><li> </li></ul>Bioinformatics - Overview
    49. 49. Questions? [email_address] Bioinformatics - Overview
    50. 50. 9 Translational Medicine