Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data

357 views

Published on

DNA Day, Department of Computer Science and Engineering, University of California, San Diego ,September 16, 2015

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data

  1. 1. “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering University of California, San Diego September 16, 2015 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD http://lsmarr.calit2.net 1
  2. 2. The National Science Foundation Has Funded Over 100 Campuses to Build “Big Data Freeways” 134 awards, 128 projects - All but 4 states - 120+ institutions
  3. 3. Creating a “Big Data” Plane on Campus: NSF Funded Prism@UCSD and CHeruB Prism@UCSD, Phil Papadopoulos, SDSC, Calit2, PI CHERuB, Mike Norman, SDSC PI CHERuB
  4. 4. SDSC Big Data Compute/Storage Facility - Interconnected at Over 1 Tbps 128 COMET VM SC 2 PF 128 Gordon Big Data SC Oasis Data Store • 128 Source: Philip Papadopoulos, SDSC/Calit2 Arista Router Can Switch 576 10Gps Light Paths 6000 TB > 800 Gbps # of Parallel 10Gbps Optical Light Paths 128 x 10Gbps = 1.3TbpsSDSC Supercomputers
  5. 5. Prism@UCSD Will Link Computational Mass Spectrometry and Genome Sequencing Cores to the Big Data Freeway ProteoSAFe: Compute-intensive discovery MS at the click of a button MassIVE: repository and identification platform for all MS data in the world Source: proteomics.ucsd.edu
  6. 6. IDI Enhanced Cyberinfrastructure Supporting Knight Lab FIONA 12 Cores/GPU 128 GB RAM 3.5 TB SSD 48TB Disk 10Gbps NIC Knight Lab 10Gbps Gordon Prism@UCSD Data Oasis 7.5PB, 100GB/s Knight 1024 Cluster In SDSC Co-Lo CHERuB 100Gbps Emperor & Other Vis Tools 64Mpixel Data Analysis Wall 120Gbps 40Gbps
  7. 7. The Pacific Wave Platform Creates a Regional Science-Driven “Big Data Freeway System” Source: John Hess, CENIC Funded by NSF $5M Oct 2015-2020 Flash Disk to Flash Disk File Transfer Rate
  8. 8. Coupling Supercomputing to Illumina Metagenomics Sequencing 5 Ileal Crohn’s Patients, 3 Points in Time 2 Ulcerative Colitis Patients, 6 Points in Time “Healthy” Individuals Source: Jerry Sheehan, Calit2 Weizhong Li, Sitao Wu, CRBS, UCSD Total of 27 Billion Reads Or 2.7 Trillion Bases Inflammatory Bowel Disease (IBD) Patients 250 Subjects 1 Point in Time 7 Points in Time Each Sample Has 100-200 Million Illumina Short Reads (100 bases) Larry Smarr (Colonic Crohn’s)
  9. 9. We Created a Reference Database Of Known Gut Genomes • NCBI April 2013 – 2471 Complete + 5543 Draft Bacteria & Archaea Genomes – 2399 Complete Virus Genomes – 26 Complete Fungi Genomes – 309 HMP Eukaryote Reference Genomes • Total 10,741 genomes, ~30 GB of sequences Now to Align Our 27 Billion Reads Against the Reference Database Source: Weizhong Li, Sitao Wu, CRBS, UCSD
  10. 10. Computational NextGen Sequencing Pipeline: From Sequence to Taxonomy and Function PI: (Weizhong Li, CRBS, UCSD): NIH R01HG005978 (2010-2013, $1.1M)
  11. 11. To Map Out the Dynamics of Autoimmune Microbiome Ecology Couples Next Generation Genome Sequencers to Big Data Supercomputers Source: Weizhong Li, UCSD Our Team Used 25 CPU-years to Compute Comparative Gut Microbiomes Starting From 2.7 Trillion DNA Bases of My Samples and Healthy and IBD Controls Illumina HiSeq 2000 at JCVI SDSC Gordon Data Supercomputer
  12. 12. Next Step Programmability, Scalability and Reproducibility using bioKepler www.kepler-project.org www.biokepler.org National Resources (Gordon) (Comet) (Stampede)(Lonestar) Cloud Resources Optimized Local Cluster Resources Source: Ilkay Altintas, SDSC
  13. 13. We Found Major State Shifts in Microbial Ecology Phyla Between Healthy and Two Forms of IBD Most Common Microbial Phyla Average HE Average Ulcerative Colitis Average LS Average Crohn’s Disease Collapse of Bacteroidetes Explosion of Actinobacteria Explosion of Proteobacteria Hybrid of UC and CD High Level of Archaea
  14. 14. Our Relative Abundance Results Across ~300 People Reveal Potential Diagnostic Species UC 100x Healthy UC 100x CD We Produced Similar Results for ~2500 Microbial Species Healthy 100x CD
  15. 15. Dell Analytics Separates The 4 Patient Types in Our Data Using Our Microbiome Species Data Source: Thomas Hill, Ph.D. Executive Director Analytics Dell | Information Management Group, Dell Software Healthy Ulcerative Colitis Colonic Crohn’s Ileal Crohn’s
  16. 16. I Built on Dell Analytics to Show Dynamic Evolution of My Microbiome Toward and Away from Healthy State – Colonic Crohn’s Healthy Ileal Crohn’s Seven Time Samples Over 1.5 Years Colonic Crohn’s
  17. 17. Time Series Reveals Oscillations in Immune Biomarkers Associated with Time Progression of Autoimmune Disease Immune & Inflammation Variables Weekly Symptoms Pharma Therapies Stool Samples 2009 20142013201220112010 2015
  18. 18. UC San Diego Will Be Carrying Out a Major Clinical Study of IBD Using These Techniques Inflammatory Bowel Disease Biobank For Healthy and Disease Patients Drs. William J. Sandborn, John Chang, & Brigid Boland UCSD School of Medicine, Division of Gastroenterology Over 200 Enrolled Announced November 7, 2014
  19. 19. Next Step Knight/Smarr Lab Collaboration • Smarr Gut Microbiome Time Series – From 7 to 50 Times Over Four Years • Healthy Human Microbiome – Use 255+ Raw Reads from NIH Human Microbiome Project • IBD Patients: From 5 Crohn’s Disease and 2 Ulcerative Colitis Patients to ~100 – 50 Carefully Phenotyped Patients Drawn from Sandborn BioBank – 43 Metagenomes from the RISK Cohort of Newly Diagnosed IBD patients, • Illumina Reagent Grant Key – Enables Deep Metagenomic (and 16S) Sequencing at IGM of Smarr + Sandborn Samples • New Software Suite from Knight Lab – Major Re-annotation of Reference Genomes, Functional and Taxonomic Variations – Novel Assembly Algorithms from Pavel Pevzner-Very Computationally Intensive – See Talk Later This Morning • Supercomputer Grant On SDSC Comet (Awarded from XSEDE) – From 25 Gordon to 100 Comet Core-Years – Each Comet Core 40GF Peak=2x Gordon Core: 8X Increase in Compute

×