Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ASU

338 views

Published on

ASU

Published in: Technology
  • Be the first to comment

ASU

  1. 1. Agenda: • Research Computing @ Arizona State University • Program, Vision and Mission • Emphasis on Open Source • Evolution in Genomic Analysis (HPC > MRv2 > Spark) J.A. Etchings RC@ASU Innovation
  2. 2. 2 Arizona State University has become the foundational model for the “New American University”, a new paradigm for the public Research University that transforms higher education. ASU is committed to Excellence, Access and Impact in everything that it does.
  3. 3. Open-source Data Driven Infrastructure Google Open-source Function GFS HDFS Distributed file system MapReduce MapReduce Batch distributed data processing Bigtable HBase Distributed DB/key-value store Protobuf/Stubby Thrift & Avro Data serialization/RPC Pregel Giraph Distributed graph processing Dremel/F1 Impala Scalable interactive SQL (MPP) FlumeJava Crunch Abstracted data pipelines on Hadoop In Memory Spark In Memory Computation Data Intensive
  4. 4. TransCORE Framework Knowledge Engine Context Ontologies Data Elements Information Models Middleware Transact Clinical Research Life Science Research Qualitative Research Analytic In-Memory Analysis Genomic Data Machine Learning Meta-Data Management Data Resources Open Big Data File System Relational Key/Value HPC Parallel HPC SMA Transactional Data Reservoir Big DataScratch Space Internet 2 / SDN Connectivity
  5. 5. The entire human genome of a single man 3 billion letters, 262,000 printed pages, 3.3GB @rikisabatini #TED2016
  6. 6. Clarification & Limitations : • Yes, we can sequence a Genome for $1000 – Unfortunately, this does not include analysis • There are 3 billion diploid basepairs, but 6 billion haploid sequences – Half come from mom and half from dad, and assembling those haplotypes - especially SNPs that are the same haplotype - is going to instrumental in future medical advances • Other limitations: – batch effects (in physical sequencing, in sequencing technology – Different software, different versions of software, and infrastructure (Standardization Gap) – Batch effects can significantly impede variant discovery (false positives are high)
  7. 7. “NEED TO FOCUS NOT ON BIG DATA, BUT BIG ANSWERS” Harper Reed – CTO Obama for America 2012
  8. 8. Tumors are not composed of identical cells: There is likely extreme intratumor heterogeneity Macro heterogeneity > 10 % frequency in the tumor Micro heterogeneity < 10 % frequency in the tumor
  9. 9. • What are the population dynamics of cancer cell populations? • What is the role of genetic drift in cancer initiation and progression? • What is the extent of subclonal variation within a tumor at the time of diagnosis? • Are resistant subclones present in a tumor before the start of therapy? Use simulations to ask:
  10. 10. Model parameters and their values • Probability of division, bn, which depends on the fitness of each cell • Mean selection coefficient, 𝑠 , to generate the exponential distribution of selection coefficients 𝑠 = [ 0.1; 0.01; 0.005 ] • Average driver mutation rate per cell division, 𝑢 𝑢 = [ 10−8; 10−7; 10−6; 10−5 ] • Generation time: average division time = 4 days* *S Jones et al. Comparative lesion sequencing provides insights into tumor evolution. PNAS (2008)
  11. 11. The model: A branching evolutionary process Death Division Division + driver mutation The process starts in a single cell with one driver mutation OR OR 1-bn (1-u)bn ubn
  12. 12. years later Driver mutation arises A clone develops Neoplastic progression starts years later The model: A branching evolutionary process
  13. 13. ≈ 98% of starting mutant clones die out early Mean selection coefficient Driver mutation rate per cell division Number of realizations Number of realizations that reached 109 cells Percentage of realizations that reached 109 cells (%) Average time to detection (years) 0.1 10155 162 1.6% 17.50 0.1 1948 112 5.7% 5.21 0.1 748 134 17.9% 1.74 0.1 748 111 14.8% 1.62 0.01 6867 125 1.8% 19.80 0.01 6866 113 1.6% 15.41 0.01 6866 120 1.7% 13.85 0.01 6865 115 1.7% 11.16 0.005 11951 102 0.9% 27.97 0.005 11751 112 1.0% 27.91 0.005 11750 126 1.1% 22.43 0.005 11750 100 0.9% 18.28 completed 88265 1432 1.6%
  14. 14. Some tumors develop very quickly (mimics childhood cancers) Mean selection coefficient Driver mutation rate per cell division Number of realizations Number of realizations that reached 109 cells Percentage of realizations that reached 109 cells (%) Average time to detection (years) 0.1 10155 162 1.6% 17.50 0.1 1948 112 5.7% 5.21 0.1 748 134 17.9% 1.74 0.1 748 111 14.8% 1.62 0.01 6867 125 1.8% 19.80 0.01 6866 113 1.6% 15.41 0.01 6866 120 1.7% 13.85 0.01 6865 115 1.7% 11.16 0.005 11951 102 0.9% 27.97 0.005 11751 112 1.0% 27.91 0.005 11750 126 1.1% 22.43 0.005 11750 100 0.9% 18.28 completed 88265 1432 1.6%
  15. 15. Some tumors take decades to develop (mimics many adult cancers, like melanoma) Mean selection coefficient Driver mutation rate per cell division Number of realizations Number of realizations that reached 109 cells Percentage of realizations that reached 109 cells (%) Average time to detection (years) 0.1 10155 162 1.6% 17.50 0.1 1948 112 5.7% 5.21 0.1 748 134 17.9% 1.74 0.1 748 111 14.8% 1.62 0.01 6867 125 1.8% 19.80 0.01 6866 113 1.6% 15.41 0.01 6866 120 1.7% 13.85 0.01 6865 115 1.7% 11.16 0.005 11951 102 0.9% 27.97 0.005 11751 112 1.0% 27.91 0.005 11750 126 1.1% 22.43 0.005 11750 100 0.9% 18.28 completed 88265 1432 1.6%
  16. 16. Computationally Intensive • Running until 10-9 cells was not efficient on a laptop • Most tumors die out before reaching a detectable limit • Need to reduce run-time, track all mutations, and subclone sizes (Massively)
  17. 17. eQTL Analysis Generation trillions of hypothesis tests • 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values • Tested below on 120 billion associations Example queries: • “Given 5 genes of interest, find top 20 most significant eQTLs (cis and/or trans)” o Finishes in several seconds • “Find all cis-eQTLs across the entire genome” o Finishes in a couple of minutes o Limited by disk throughput
  18. 18. 862 306 473 168 404 138 776 308 474 166 387 136 700 192 332 125 240 119 0 100 200 300 400 500 600 700 800 900 1000 eQTL-Cases eQTL-Controls eQTL-Cases eQTL-Controls eQTL-Cases eQTL-Controls 5 10 15 Cloudera Hortonworks MapR Time taken in minutes Number of Cores Map Reduce HPC Apache Spark
  19. 19. • Took a day to get a tumor to 10-7 – (still 2 orders of magnitude too small) • Convert code from MatLab to Scala (Spark) • Takes seconds to simulate a single tumor • Ability to generate tens of thousands of possible tumors, and thousands of measurable tumors, observed dynamics
  20. 20. Standard Output 0.00.20.4 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.20.40.6 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.8 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.81.2 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.20.4 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.8 Subclone size (number of cells)Density 10 2 10 4 10 6 10 8 10 10 0.00.40.81.2 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.81.2 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.81.2 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.8 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.81.2 Subclone size (number of cells)Density 10 2 10 4 10 6 10 8 10 10 0.00.40.81.2 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 𝑠 = 0.1, μd = 10-8 𝑠 = 0.01, μd = 10-8 𝑠 = 0.005, μd = 10-8 𝑠 = 0.1, μd = 10-7 𝑠 = 0.01, μd = 10-7 𝑠 = 0.005, μd = 10-7 𝑠 = 0.1, μd = 10-6 𝑠 = 0.01, μd = 10-6 𝑠 = 0.005, μd = 10-6 𝑠 = 0.1, μd = 10-5 𝑠 = 0.01, μd = 10-5 𝑠 = 0.005, μd = 10-5 N = 162 N = 112 N = 134 N = 111 N = 125 N = 113 N = 120 N = 115 N = 102 N = 112 N = 126 N = 100 DensityDensityDensity Subclone size (number of cells) Subclone size (number of cells)Subclone size (number of cells) Subclone size (number of cells) Subclone size (number of cells)
  21. 21. 0.00.40.8 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.8 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.8 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.81.2 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.8 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 0.00.40.8 Subclone size (number of cells) Density 10 2 10 4 10 6 10 8 10 10 N = 111 N = 115 N = 100 N = 134 N = 120 N = 126 𝑠 = 0.1, μd = 10-6 𝑠 = 0.1, μd = 10-5 𝑠 = 0.01, μd = 10-6 𝑠 = 0.01, μd = 10-5 𝑠 = 0.005, μd = 10-6 𝑠 = 0.005, μd = 10-5 Resistant subclone size (number of cells) Resistant subclone size (number of cells) DensityDensityDensity Standard Output
  22. 22. 41% 1 driver mutations 10% 2 driver mutations 19% 2 driver mutations Output to Tableau
  23. 23. Minor subclones that harbor mutations resistant to treatment can result in relapse 4 months on drug 6 months on drug N. Wagle et al., Journal of Clinical of Oncology (2011) Response to vemurafenib (V600E BRAF inhibitors)
  24. 24. Subclonal variation of simulated tumor-1 at diagnosis 𝑠 = 0.005, u =10−5 per cell division, and mean division time = 4 daysNumberofcells Subclonal compositionPopulation dynamics of cancer cells subclone with a resistance mutation N = 2,682 cells Resistant mutation rate = 17% 1 driver mutation 80% 2 driver mutations Time (years)
  25. 25. Subclonal variation of simulated tumor-2 at diagnosis Numberofcells Time (years) Subclonal composition 𝑠 = 0.01, u =10−5 per cell division, and mean division time = 4 days 19% 2 driver mutations 10% 2 driver mutations 41% 1 driver mutations subclone with a resistance mutation N = 224,502 cells Resistant mutation rate = 𝟏𝟎−𝟖 Population dynamics of cancer cells
  26. 26. Conclusions: • These results constitute an argument for the development and application of more sensitive technologies for the detection of rare pre-existing subclones that might plant the seeds for rapid clinical relapse. • Based on the predicted extent of standing subclonal variation, drug-resistant subclones are almost certain to exist before the initiation of treatment initiation. • Greater subclonal diversity in a tumor may predict a higher likelihood of pre-existing resistance to any conceivable targeted therapy • Subclonal diversity itself may be a marker of the potential to evolve drug resistance, and therefore may be an important prognostic indicator • Reducing the time to research output with Apache Spark increases the success probability of targeted therapies
  27. 27. The extent of subclonal variation is predicted by number of distinct dominant clones Diego Chowella,b, James Napierc, Rohan Guptac, Karen S. Andersonb,d, Carlo C. Maleyb,d,f,1, and Melissa A. Wilson Sayresb,d,e,1 aMathematical, Computational and Modeling Sciences Center, bBiodesign Institute, cResearch Computing Center, dSchool of Life Sciences, eCenter for Evolution and Medicine, Arizona State University, Tempe, Arizona 85281, USA, fCenter for Evolution and Cancer, University of California San Francisco, San Francisco, California 94158, USA 1To whom correspondence may be addressed E-mail: maley@asu.edu or melissa.wilsonsayres@asu.edu (wilsonsayreslab.org | @mwilsonsayres )

×