Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

371 views

Published on

My slides for geXC 2016 (http://www.gexc2016.com/) today.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

  1. 1. A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data İlkay ALTINTAŞ, Ph.D. Chief Data Science Officer, San Diego Supercomputer Center Founder and Director, Workflows for Data Science Center of Excellence
  2. 2. SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego Providing Cyberinfrastructure for Research and Education • Established as a national supercomputer resource center in 1985 by NSF • A world leader in HPC, data- intensive computing, and scientific data management • Current strategic focus on “Big Data”, “versatile computing”, and “life sciences applications” 1985 today Two discoveries in drug design from 1987 and 1991.
  3. 3. Ross Walker Group SDSC continues to be a leader in scientific computing and big data! Gordon: First Flash-based Supercomputer for Data-intensive Apps Comet: Serving the Long Tail of Science 27 standard racks = 1944 nodes = 46,656 cores = 249 TB DRAM = 622 TB SSD ~ 2 Pflop/s • 36 GPU nodes • 4 Large Memory nodes • 7 PB Lustre storage • High performance virtualization
  4. 4. SDSC Data Science Office -- Expertise, Systems and Training for Data Science Applications -- SDSC Data Science Office (DSO) SDSC DSO is a collaborative virtual organization at SDSC for collective lasting innovation in data science research, development and education. DSO SDSC Expertise and Strengths BigDataPlatforms Training Industry Applications
  5. 5. Life Sciences is an ongoing strategic application thrust at SDSC…
  6. 6. Genomic Analysis is a Big Data and Big Compute Problem BIG DATA COMPUTING AT SCALE Enables dynamic data-driven applications Computer-Aided Drug Discovery Personalized Precision Medicine Requires: • Data management • Data-driven methods • Scalable tools for dynamic coordination and resource optimization • Skilled interdisciplinary workforce Team work and process management Vaccine Development Metagenomics …
  7. 7. New era of data science! Needs and Trends for the New Era Data Science -- the Big Data Era Goals -- • More data-driven • More dynamic • More process-driven • More collaborative • More accountable • More reproducible • More interactive • More heterogeneous
  8. 8. Velocity Variety Volume Scalable batch processing Stream processing Extensible data storage, access and integration Genomic Data Management and Processing in the Big Data Era has Unique Challenges!
  9. 9. HBase Hive Pig Zookeeper Giraph Storm Spark MapReduce YARN MongoDB Cassandra HDFS Flink Lower levels: Storage and scheduling Higher levels: Interactivity These challenges push for new tools to tackle them.
  10. 10. COORDINATION AND WORKFLOW MANAGEMENT DATA INTEGRATION AND PROCESSING DATA MANAGEMENT AND STORAGE How do we use these new tools and combine them with existing domain-specific solutions in scientific computing and data science?
  11. 11. COORDINATION AND WORKFLOW MANAGEMENT DATA INTEGRATION AND PROCESSING DATA MANAGEMENT AND STORAGE Layer 1: Data Management and Storage
  12. 12. COORDINATION AND WORKFLOW MANAGEMENT DATA INTEGRATION AND PROCESSING DATA MANAGEMENT AND STORAGE Layer 2: Data Integration and Processing HBase Hive PigZookeeper Giraph Storm Spark MapReduce YARN MongoDB Cassandra HDFS Flink + Application specific libraries
  13. 13. Most of the time, more than one analysis need to take place… And each analysis has multiple steps to integrate!
  14. 14. Pipelining is a way to put the steps together. Source: http://www.slideshare.net/BigDataCloud/big-data-analytics-with-google- cloud-platform Source: https://www.mapr.com/blog/distributed-stream-and-graph-processing- apache-flink Source: https://www.computer.org/csdl/mags/so/2016/02/mso2016020060.html Source: http://www.slideshare.net/ThoughtWorks/big-data-pipeline-with-scala
  15. 15. COORDINATION AND WORKFLOW MANAGEMENT DATA INTEGRATION AND PROCESSING DATA MANAGEMENT AND STORAGE Layer 3: Coordination and Workflow Management
  16. 16. COORDINATION AND WORKFLOW MANAGEMENT ACQUIRE PREPARE ANALYZE REPORT ACT … kepler-project.org
  17. 17. Workflows for Data Science Center of Excellence at SDSC Building functional, operational and reproducible solution architectures using big data and HPC tools is what we do. Focus on the question, not the technology! • Access and query data • Scale computational analysis • Increase reuse • Save time, energy and money • Formalize and standardize Real-Time Hazards Management wifire.ucsd.edu Data-Parallel Bioinformatics bioKepler.org Scalable Automated Molecular Dynamics and Drug Discovery nbcr.ucsd.edu WorDS.sdsc.edu
  18. 18. bioKepler: A Kepler Module for Bio Big Data Analysis Data-Parallel Bioinformatics bioKepler.org
  19. 19. Source: Larry Smarr, Calit2 • Metagenomic Sequencing • JCVI Produced • ~150 Billion DNA Bases From Seven of LS Stool Samples Over 1.5 Years • ~3 Trillion DNA Bases From NIH Human Microbiome Program Data Base • 255 Healthy People, 21 with IBD Illumina HiSeq 2000 at JCVI SDSC Gordon Data Supercomputer Example from 2013: Inflammatory Bowel Disease (IBD) • Supercomputing (W.Li, JCVI/HLI/UCSD): • ~20 CPU-Years on SDSC’s Gordon • ~4 CPU-Years on Dell’s HPC Cloud • Produced Relative Abundance of • ~10K Bacteria, Archaea, Viruses in ~300 People • ~3 Million Filled Spreadsheet Cells
  20. 20. Ongoing Research: Optimization of Heterogeneous Resource Utilization using bioKepler National Resources (Gordon) (Comet) (Stampede) (Lonestar) Cloud Resources Optimized Local Cluster Resources Uses existing genomics tools and computing systems! Computing is just one part of it… …new methods needed!
  21. 21. Needs of a Dynamic Ecosystem of Genomic Discovery • Exploratory methods to see temporal changes and patterns in sequence data • Efficient updates to analysis as quick as new sequence data gets generated • Regular reruns of annotations as reference databases evolve • Integration of genomic data with other types of data, e.g., image, environmental, social graphs • Dynamic ability to check quality and provenance of data and analysis • Transparent support for computing platforms designed for genomic discovery and pattern analysis • Workflow coordination and system integration • People and culture to make it happen collaboratively!
  22. 22. Examples from 2016: Apache Big Data Technologies in Life Sciences • Lightning Fast Genomics with ADAM • Goal • Study genetic variations in populations at scale (e.g., 1000 Genomes Project) • Technology stack • Apache Avro (data serialization, schema definition) • Apache Parquet (compact columnar storage) • Apache Spark (distributed parallel processing) • Spark MLlib (machine learning, clustering) • Source: AMPLab, UC Berkeley (http://bdgenomics.org/) • Compressive Structural Bioinformatics using MMTF • Goal • 100+ speedup of large-scale 3D structural analysis of the Protein Data Bank (PDB) • Technology stack • MMTF (Macromolecular Transmission format, compact storage in Hadoop Sequence Files) • Apache Spark (in-memory, parallel distributed workflows using compressed data) • Spark ML (clustering) • Source: SDSC, UC San Diego (http://mmtf.rcsb.org/)
  23. 23. Development of tools and technologies that enable models to bridge across diverse scales of biological organization, while leveraging all types and sources of data NBCR Example: Distilling Medical Image Data for Biomedical Action nbcr.ucsd.edu
  24. 24. Identify gaps in multiscale modeling capabilities and develop new methods and tools that allow us to bridge across these gaps Å nm – μm 0.1mm - mm cm fs - μs μs - ms ms - s s - lifespan Molecular & Macromolecular Sub-Cellular Cell Tissue Organ Spatialand TemporalScales Driving Biomedical Projects propel technology development across multi-scale modeling capability gaps, from simulation to data assembly & integration
  25. 25. A challenge: Data Integration Challenge to bridge across diverse scales of biological organization, to understand emergent behavior, and the molecular mechanisms underlying biological function & disease
  26. 26. Integrated Multi-Scale Modeling Toolkits in NBCR User Interface NBCR Products Battling complexity while facilitating collaboration and increasing reproducibility. Cyberinfrastructure Innovation Based on User Needs Domain-specific tools, workflows, data and computing infrastructure. Components for Multi-Scale Modeling A handful of customizable and and extensible tools, workflows, user interfaces and publishable research objects. NBCR Products Workflows Scientific Tools Past Experiments • UI generation • Logical workflow generation • Uncertainty quantification • Workflow execution • Provenance tracking • System integration
  27. 27. medium Prima-1 Sticticacid 35ZWF 25KKL 22LSV 32CTM 26RQZ 27WT9 33AG6 33BAZ 28NZ6 27TGR 27VFS 35LWZ 36EB5 27UDP 32LDE 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 no p53 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4" no com poundP rim a-1 35ZW F25K K L25P W S24M LP26Y Y G 22LS V24M N R32C TM 22K TV24M Y 424LB C24N P U24N W 3 Series1" Series2" 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 1.4" no com poundP rim a-1 35ZW F25K K L25P W S24M LP26Y Y G 22LS V24M N R32C TM 22K TV24M Y 424LB C24N P U24N W 3 Series1" Series2"cancer cell with p53-R175H mutant cell proliferation 15 new reactivation compounds reactivation compounds kill cells with p53 cancer mutant BENEFITS: • Increase reuse • Reproducibility • Scale execution, problem & solution • Compare methods • Train students
  28. 28. Minimization Actor Equilibration Actor AMBER GPU MD Workbench
  29. 29. Rommie Amaro, PI, UCSD Computational chemistry, biophysics Andrew McCammon, UCSD Computational chemistry, biophysics, chemical physics Mark Ellisman, UCSD Molecular & cellular biology Andrew McCulloch, UCSD Bioengineering, biophysics Michel Sanner, TSRI Drug discovery & molecular visualization Phil Papadopoulos, UCSD/SDSC Computer engineering, cyberinfrastructure technology Ilkay Altintas, UCSD/SDSC Workflows, provenance Michael Holst, UCSD Math, physics Arthur Olson, TSRI Computational chemistry, drug discovery, visualization LEADERSHIP TEAM
  30. 30. Training at the interface Challenge: how do we build the next generation of interdisciplinary scientists? Data-to-Structural-Models Simulation-Based Drug Discovery
  31. 31. Biomedical Big Data Training Collaboratory http://biobigdata.ucsd.edu • BBDTC website is up and evolving! • BBDTC contains sevenfull, open biomedical training courses • Four-course biomedical big data series is planned for Winter 2017
  32. 32. Working with Industry Partners at SDSC
  33. 33. SDSC Provides a Range of Strategies for Engaging with Industry • Sponsored research agreements • Service agreements for use of systems & consulting • Focused centers of excellence (Big Data Systems, Predictive Analytics, Workflow Technologies) • Training programs in Data Science & Analytics • Industry Partners Program for “jump starting” collaborations Working with industry helps companies be more competitive, drives innovation, and fosters a healthy ecosystem between the research and private sector.
  34. 34. Example for Industrial Collaboration: Janssen R&D Rheumatoid Arthritis Study • Janssen was interested in correlating genomic profile with response to TNFα inhibitor golimumab • Sequenced 438 patients (full genome) • SDSC assisted with re-alignment and variant calling using new/improved algorithms • Needed analysis done in a reasonable timeframe (a few weeks)
  35. 35. Questions? IlkayAltintas,Ph.D. Email:ialtintas@ucsd.edu

×