Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome


Published on

Keynote Talk
International Conference on Computational Science
San Diego, CA
June 6, 2016

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome

  1. 1. “Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome” Keynote Talk International Conference on Computational Science San Diego, CA June 6, 2016 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD 1
  2. 2. Abstract The human body is host to 100 trillion microorganisms, ten times the number of DNA-bearing cells in the human body, and these microbes contain 300 times the number of DNA genes that our human DNA does. The microbial component of our "superorganism" is comprised of hundreds of species with immense biodiversity. To put a more personal face on the "patient of the future," I have been collecting massive amounts of data from my own body over the last seven years, which reveals detailed examples of the episodic evolution of this coupled immune-microbial system. Collaborating with the UC San Diego Knight Lab, we have genetically sequenced a time series of my gut microbiome, as well as single moments from 50 patients with autoimmune disease. An elaborate software pipeline, running on high performance computers, reveals the details of the microbial ecology and its genetic components, in health as well as in disease. Not only can we compare a person with a disease to a healthy population, but we can also follow the dynamics of the diseased patient. We can look forward to revolutionary changes in medical practice over the next decade.
  3. 3. Forty Years of Computing Gravitational Waves From Colliding Black Holes 1977 L. Smarr and K. Eppley Gravitational Radiation Computed from an Axisymmetric Black Hole Collision 2016 LIGO Consortium Spiral Black Hole Collision 40 Years
  4. 4. Complexity of Computing First Gut Microbiome Dynamics Versus First Dynamics of Colliding Black Holes • My 1975 PhD Dissertation – Solving Einstein’s Equations of General Relativity for Colliding Black Holes and Grav Waves – CDC 6600 Megaflop/s – Hundreds of Hours • Rob Knight and Smarr Gut Microbiome Map – Mapping From Illumina Sequencing to Taxonomy and Gene Abundance Dynamics – Comet Petaflop/s – Comet Core is 40,000x CDC6600 Speed – Million Core-Hours – 10,000x Supercomputer Time • Gut Microbiome Takes ~ ½ Billion Times the Compute Power of Early Solutions of Dynamic General Relativity
  5. 5. As a Model for the Precision Medicine Initiative, I Have Tracked My Internal Biomarkers To Understand My Body’s Dynamics My Quarterly Blood Draw Calit2 64 Megapixel VROOM
  6. 6. Only One of My Blood Measurements Was Far Out of Range--Indicating Chronic Inflammation Normal Range <1 mg/L 27x Upper Limit Complex Reactive Protein (CRP) is a Blood Biomarker for Detecting Presence of Inflammation Episodic Peaks in Inflammation Followed by Spontaneous Drops
  7. 7. Adding Stool Tests Revealed Oscillatory Behavior in an Immune Variable Which is Antibacterial Normal Range <7.3 µg/mL 124x Upper Limit for Healthy Lactoferrin is a Protein Shed from Neutrophils - An Antibacterial that Sequesters Iron Typical Lactoferrin Value for Active Inflammatory Bowel Disease (IBD)
  8. 8. To Understand the Interaction of Genetics and the Immune System We Must Consider the Human Microbiome Your Microbiome is Your “Near-Body” Environment and its Cells Contain 100x as Many DNA Genes As Your Human DNA-Bearing Cells Your Body Has 10 Times As Many Microbe Cells As DNA-Bearing Human Cells Inclusion of the “Dark Matter” of the Body Will Radically Alter Medicine
  9. 9. Most of Evolutionary Time Was in the Microbial World You Are Here Source: Carl Woese, et al Tree of Life Derived from 16S rRNA Sequences
  10. 10. The Cost of Sequencing DNA Has Fallen Over 100,000x in the Last Ten Years This Has Enabled Sequencing of Both Human and Microbial Genomes
  11. 11. June 8, 2012 June 14, 2012 Interest in the Human Microbiome Has Moved Quickly From Frontier Science to Public Awareness August 18, 2012June, 2012
  12. 12. President Obama Announces National Microbiome Initiative May 13, 2016
  13. 13. To Map Out the Dynamics of Autoimmune Microbiome Ecology Couples Next Generation Genome Sequencers to Big Data Supercomputer 5 Ileal Crohn’s Patients, 3 Points in Time 2 Ulcerative Colitis Patients, 6 Points in Time “Healthy” Individuals Source: Jerry Sheehan, Calit2 Weizhong Li, Sitao Wu, CRBS, UCSD Total of 27 Billion Reads Or 2.7 Trillion Bases Inflammatory Bowel Disease (IBD) Patients 250 Subjects 1 Point in Time 7 Points in Time Each Sample Has 100-200 Million Illumina Short Reads (100 bases) Larry Smarr (Colonic Crohn’s)
  14. 14. Computational NextGen Sequencing Pipeline: From Sequence to Taxonomy and Function PI: (Weizhong Li, CRBS, UCSD): NIH R01HG005978 (2010-2013, $1.1M)
  15. 15. We Used SDSC’s Gordon Data-Intensive Supercomputer to Completely Analyze a Subset of These Gut Microbiomes • ~180,000 Core-Hours on Gordon – KEGG Protein Family Annotation: 90,000 Core-Hours – Mapping: 36,000 core-hrs – Used 16 Cores/Node and up to 50 nodes – Duplicates removal: 18,000 core-hrs – Assembly: 18,000 core-hrs – Other: 18,000 core-hrs • Gordon RAM Required – 64GB RAM for Reference DB – 192GB RAM for Assembly • Gordon Disk Required – Ultra-Fast Disk Holds Ref DB for All Nodes – 8TB for All Subjects Enabled by a Grant of Time on Gordon from SDSC Director Mike Norman Source: Weizhong Li, UCSD
  16. 16. We Used Dell’s HPC Cloud to Extend Our Taxonomic Analysis to All of Our Human Gut Microbiomes • Dell’s Sanger Cluster – 32 Nodes, 512 Cores – 48GB RAM per Node • We Processed the Taxonomic Relative Abundance – Used ~35,000 Core-Hours on Dell’s Sanger • Produced Relative Abundance of ~10,000 Bacteria, Archaea, Viruses in ~300 People – ~3Million Spreadsheet Cells Source: Weizhong Li, UCSD Enabled by a Grant of Time From Dell/R Systems
  17. 17. We Found Major State Shifts in Microbial Ecology Phyla Between Healthy and Three Forms of IBD Most Common Microbial Phyla Average HE Average Ulcerative Colitis Average LS Colonic Crohn’s Disease Average Ileal Crohn’s Disease Collapse of Bacteroidetes Explosion of Actinobacteria Explosion of Proteobacteria Hybrid of UC and CD High Level of Archaea
  18. 18. Building a UC San Diego High Performance Cyberinfrastructure to Support Distributed Microbiome Analysis FIONA 12 Cores/GPU 128 GB RAM 3.5 TB SSD 48TB Disk 10Gbps NIC Knight Lab 10Gbps Gordon Prism@UCSD Data Oasis 7.5PB, 200GB/s Knight 1024 Cluster In SDSC Co-Lo CHERuB 100Gbps Emperor & Other Vis Tools 64Mpixel Data Analysis Wall 120Gbps 40Gbps 1.3Tbps PRP/
  19. 19. We Use OpenOrd on Calit2’s 64M Pixel Tiled Wall to Explore Clustering of Patients and Microbe Species Ileal Crohn’s Healthy Ulcerative Colitis Source: Philip Weber, QI, UCSD
  20. 20. UCSD is Becoming a National Leader in Human Microbiome Research
  21. 21. How stable is our microbiome over time (if we’re healthy)?
  22. 22. Source: Rob Knight, UCSD Mouth Stool Vagina Skin
  23. 23. Source: Rob Knight, UCSD
  24. 24. Can our microbiome ecology shift over time (if we’re not healthy)?
  25. 25. Larry’s 40 Stool Samples Over 3.5 Years to Rob’s lab on April 30, 2015
  26. 26. Larry Smarr Gut Microbiome Ecology Shifted After Drug Therapy Between Two Time-Stable Equilibriums Correlated to Physical Symptoms Lialda & Uceris 12/1/13 to 1/1/14 12/1/13- 1/1/14 Frequent IBD Symptoms Weight Loss 7/1/12 to 12/1/14 Blue Balls on Diagram to the Right Principal Coordinate Analysis of Microbiome Ecology PCoA by Justine Debelius and Jose Navas, Knight Lab, UCSD Weight Data from Larry Smarr, Calit2, UCSD Weekly Weight Few IBD Symptoms Weight Gain 1/1/14 to 8/1/16 Red Balls on Diagram to the Right
  27. 27. Each Microbe Contains a Few Thousand Genes on Its DNA E. Coli Contains ~5000 Genes on its Circular Chromosome, Which is 1000x the Length of the Cell! Several Million Genes Can Occur in the Human Gut Microbiome
  28. 28. In a “Healthy” Gut Microbiome: Large Taxonomy Variation, Low Protein Family Variation Source: Nature, 486, 207-212 (2012) Over 200 People
  29. 29. We Computed the Relative Abundance of 10,000 KEGG Orthogolous Protein Families In Health and Disease States Kyoto Encyclopedia of Genes and Genomes (KEGG)
  30. 30. Using PCA on the 10,000 KEGG Protein Families We Can Discover Over- and Under-Abundant Genes in Health and Disease Source: Bryn Taylor, Justine Debelius, Rob Knight, Mehrdad Yazdani, Larry Smarr, UCSD; Weizhong Li, JCVI
  31. 31. Using Kolmogorov-Smirnov Test and Random Forest Machine Learning, We Can Classify Over and Under-Abundant Protein Families Source: Bryn Taylor, Justine Debelius, Rob Knight, Mehrdad Yazdani, Larry Smarr, UCSD; Weizhong Li, JCVI Note: Orders of Magnitude Increase or Decrease in Protein Families Between Health and Disease Next Step: Which Proteins (Functions) are Altered?
  32. 32. To Expand IBD Project the Knight/Smarr Labs Were Awarded ~ 1 Million Core-Hours on SDSC’s Comet Supercomputer • 8x Compute Resources Over Prior Study • Smarr Gut Microbiome Time Series – From 7 Samples Over 1.5 Years – To 50 Samples Over 4 Years • IBD Patients: From 5 Crohn’s Disease and 2 Ulcerative Colitis Patients to ~100 Patients – 50 Carefully Phenotyped Patients Drawn from Sandborn BioBank – 43 Metagenomes from the RISK Cohort of Newly Diagnosed IBD patients • New Software Suite from Knight Lab – Re-annotation of Reference Genomes, Functional / Taxonomic Variations – Novel Compute-Intensive Assembly Algorithms from Pavel Pevzner
  33. 33. We Used SDSC’s Comet to Uniformly Compute Protein-Coding Genes, RNAs, & CRISPR Annotations • We Downloaded from NCBI Over 60,000 Bacterial and Archaea Genomes – Required 5 Core-Hours Per Genome – 300,000 Core-Hours to Complete – Ran 24 Cores in Parallel – Over 400 Days Wall-Clock Time • Requires a Variety of Software Programs – Prodigal for Gene Prediction – Diamond for Protein Homolog Search Against UniRef db – Infernal for ncRNA Prediction – RNAMMER for rRNA Prediction – Aragorn for tRNA Prediction • Will Make These Results a New Community Database – Knight Lab, Calit2, SDSC Source: Zhenjiang (Zech) Xu, Knight Lab, UCSD
  34. 34. Next Large Supercomputer Project: Addressing the Challenges of Metagenomic Assembly • Differences Between Closely Related Strains • Varying Coverage Depth Across Individual Genomes • Inter-Species Repeats (Ribosomal Genes, HGTs, etc.) • Huge Size and Complexity of Datasets metaSPAdes: a new versatile assembler for metagenomic data Nagarajan and Pop Nature Reviews Genetics 2013 Sergey Nurk1, Dmitry Meleshko1, Anton Korobeynikov1 and Pavel Pevzner1,2 1Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia 2University of California San Diego, La Jolla, USA
  35. 35. Massive Research is Underway to Discover A Wide Range of New Techniques for Manipulating Your Microbiome
  36. 36. Genetic Sequencing of Humans and Their Microbes Is a Huge Growth Area and the Future Foundation of Medicine Source: @EricTopol Twitter 9/27/2014
  37. 37. Thanks to Our Great Team! Calit2@UCSD Future Patient Team Jerry Sheehan Tom DeFanti Joe Keefe John Graham Kevin Patrick Mehrdad Yazdani Jurgen Schulze Andrew Prudhomme Philip Weber Fred Raab Ernesto Ramirez UCSD CSE Department Pavel Pevzner JCVI Team Karen Nelson Shibu Yooseph Manolito Torralba Ayasdi Devi Ramanan Pek Lum UCSD Metagenomics Team Weizhong Li Sitao Wu SDSC Team Michael Norman Mahidhar Tatineni Robert Sinkovits Ilkay Altintas UCSD Health Sciences Team David Brenner Rob Knight Lab Justine Debelius Jose Navas Bryn Taylor Gail Ackermann Greg Humphrey William J. Sandborn Lab Elisabeth Evans John Chang Dell/R Systems Brian Kucic John Thompson Thomas Hill