Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How novel compute technology transforms life science research

588 views

Published on

Unprecedented data volumes and pressure on turnaround time driven by commercial applications require bioinformatics solutions to evolve to meed these new demands. New compute paradigms and cloud-based IT solutions enable this transition. Here I present two solution capable of meeting these demands for genomic variant analysis, VariantSpark, as well as genome engineering applications, GT-Scan2.
VariantSpark classifies 3000 individuals with 80 Million genomic variants each in under 30 minutes. This Hadoop/Spark solution for machine learning application on genomic data is hence capable to scale up to population size cohorts.
GT-Scan2, identifies CRISPR target sites by minimizing off-target effects and maximizing on-target efficiency. This optimization is powered by AWS Lambda functions, which offer an “always-on” web service that can instantaneously recruit enough compute resources keep runtime stable even for queries with several thousand of potential target sites.

Published in: Science
  • Be the first to comment

  • Be the first to like this

How novel compute technology transforms life science research

  1. 1. How novel compute technology transforms life science research From Hadoop Spark to cloud-based micro-services HEATH & BIOSECURITY Dr Denis Bauer | Bioinformatics | @allPowerde 6 Dec 2016 – Cloudera Public Sector Government Forum, Canberra stuckincustoms
  2. 2. Overview Transformational Bioinformatics | Denis C. Bauer | @allPowerde GT-Scan2 How can genome engineering be made safer? VariantSpark How to find disease genes in population-size cohorts? CSIRO How to facilitate better collaborations?
  3. 3. Team CSIRO Transformational Bioinformatics | Denis C. Bauer | @allPowerde 5319 talented staff $1billion+ budget Working with over 2800+ industry partners 55 sites across Australia Top 1% of global research agencies Each year 6 CSIRO technologies contribute $5 billion to the economy
  4. 4. Big ideas start here Transformational Bioinformatics | Denis C. Bauer | @allPowerde EXTENDED WEAR CONTACTS POLYMER BANKNOTES RELENZA FLU TREATMENT Fast WLAN Wireless Local Area Network AEROGARD TOTAL WELLBEING DIET RAFT POLYMERISATION BARLEYmax™ SELF TWISTING YARN SOFTLY WASHING LIQUID HENDRA VACCINE NOVACQ™ PRAWN FEED Convenient cardiac rehabilitation Enhancing relationship between patient and mentor Digital data collection Equitable access World's first, clinically validated smartphone based Cardiac Rehab: uptake + 30% and completion +70%
  5. 5. Preparation for and recovery from a Total Knee Replacement o Remote monitoring by Clinician o Physiotherapy o Wearable Technology o Gamification
  6. 6. Genomic sequencing is revolutionizing Health Care today. It offers up to 50% more diagnoses than standard of care and is on average 96% cheaper Bauer et al. Trends Mol Med. 2014 PMID: 24801560 Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  7. 7. Advances in sequencing technology has generated the capacity to sequence the Earth’s Genome in just 10 days The human genome is 3 billion letters long Transformational Bioinformatics | Denis C. Bauer | @allPowerde need 3 billion samples to robustly analyze
  8. 8. 100,000 Genomes project 70,000 individuals by 2017 The cancer genome atlas 11,000 samples 2015 Genomics projects hence are getting bigger Transformational Bioinformatics | Denis C. Bauer | @allPowerde The HapMap Project 270 samples 2002 Human genome ~1 sample 1000 Genome Project 1097 samples 2012 ASPREE 4000 healthy 70+ year olds Project MinE 15,000 people with ALS Single samples are around 200GB in size
  9. 9. New demands on sequence analysis Transformational Bioinformatics | Denis C. Bauer | @allPowerde • The sheer volume of new data necessitates new approaches. Computational genomics must progress from file formats to APIs, from local hardware to the elasticity of the cloud, from a cottage industry of poorly maintained academic software to professional-grade, scalable code, and from one-time evaluation by publication to continuous evaluation by online benchmarks. Paten et al. The NIH BD2K center for big data in translational genomics JAMIA 2015
  10. 10. Elasticity in the Cloud Transformational Bioinformatics | Denis C. Bauer | @allPowerde 1 Elastic cloud compute… is like an In-room sound system Benefits: • Instant availability of adequately powered system • Images can be shared and everything on it is automatically version controlled
  11. 11. Efficient scalability2 Kelly et al. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics Genome Biology 2015 Bespoke parallelization e.g. Churchill Chromosomal split e.g. NGSANE MapReduce e.g. GATK queue Transformational Bioinformatics | Denis C. Bauer | @allPowerde11 | Beunder 2010 Embedded
  12. 12. Population-scale genomic data analysis requires BigData solutions Desktop compute High-performance compute cluster Hadoop/Spark compute cluster Focus small data Compute-intensive Data-intensive Fault tolerant No No Yes Node-bound Yes Yes No Parallelization 10 CPU 100 CPU 1000 CPU Parallelization procedure bespoke bespoke standardized Transformational Bioinformatics | Denis C. Bauer | @allPowerde CSIRO solution
  13. 13. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Spark Summit 2016 (June) by Frank Austin Nothaft (UC Berkeley) (70TB – 300 individuals) One human genome analyzed (variant called) every 3.2 hours
  14. 14. Still not fast enough… Clinical genomics facilities expect to deal with >18,000 genomes a year, so a 3.2h TAT would accumulate 6.5 years of compute. CSIRO along with other prominent research institutes (MIT, Berkeley) partnered with cloudera and AWS to investigate • HPC-based solutions • GATKspark (The Spark reimplementation of the accepted gold standard) • ADAM Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  15. 15. Setup Transformational Bioinformatics | Denis C. Bauer | @allPowerde • Instances – 5 worker – 3 Hadoop scheduler – one Cloudera manager • Why we chose to go with a cloudera solution – Set-up and deploy is automated, e.g. no manual IP-address matching – No need for admin support, e.g. preconfigured – Set up is portable to other providers and on-premise
  16. 16. All humans carry between 200 to 800 mutation that disrupt the function of a gene. Which needle is the right one? Transformational Bioinformatics | Denis C. Bauer | @allPowerde http://science.sciencemag.org/content/335/6070/823.full https://waynealliance.wordpress.com/2010/06/02/all-needles-no-hay/
  17. 17. Transformational Bioinformatics | Denis C. Bauer | @allPowerde BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4) 0 1000 2000 Python R H adoop Adam AD M IXTU R E VariantSpark method timeinseconds task binary−conversion clustering pre−processing It can classify 3000 individuals and 80 million variants in under 30 minutes
  18. 18. • Collaboration between CSIRO, NCI and the John Curtin School of Medical Research (JCSMR) • Reuse AWS cluster on NCI on-premise cluster. – Cluster built by joint effort by CSIRO Hadoop administrator and local Cloudera staff – VariantSpark deployed and running within only 3 days • Demonstration of the lower risk for organisations with proof of concept Setup Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  19. 19. NHMRC: Dementia Research Teams Grant led by Ian Blair (MQ) Developing insight into the molecular origins of familial and sporadic frontotemporal dementia and amyotrophic lateral sclerosis Transformational Bioinformatics | Denis C. Bauer | @allPowerde Affected 900 WGS Normal 1400 WGS Identify causative mutations Cluster Individuals on disease progression Application cases for a VariantSpark cluster Kidney disease: Simon Foote (JCSMR) Uncover genetic cause of early onset kidney failure.
  20. 20. Genome Engineering is currently being developed for medical treatments in humans, such as cancer, blindness, HIV treatment. However, the molecular technology, CRISPR, is not 100% efficient. Aim: Develop computational guidance framework to enable edits the first time; every time. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  21. 21. Achieving the first time; every time 1. Better understanding of the science 2. Higher powered computational tools • Super-computing-scale analysis • Interactive real time analysis (query style research) Transformational Bioinformatics | Denis C. Bauer | @allPowerde lauren riddoch iconfinder GT-Scan2 Ranked choices
  22. 22. • We tested GT-scan2.0 against two publically available models: • sgRNAscorer (Chari et al 2015, Nature Methods) • WU-CRISPR (Wong et al 2015, Genome Biology) • Tested 2 independent datasets (>4000 sgRNAs) • Our chromatin aware model consistently outperformed the other models Better Science Transformational Bioinformatics | Denis C. Bauer | @allPowerde AreaUnderthePrecision/RecallCurveRecall Precision Validation Set 1
  23. 23. Higher powered instantaneous compute Desktop compute High-performance compute Hadoop/Spark Microservices Focus small data Compute-intensive Data-intensive Agility Fault tolerant No No Yes (Yes) Node-bound Yes Yes No No Parallelization 10 CPU 100 CPU 1000 CPU 1000 CPU Parallelization procedure bespoke bespoke standardized standardized Overhead in the cloud NA spin-up lag spin-up lag instantaneously Transformational Bioinformatics | Denis C. Bauer | @allPowerde CSIRO solution
  24. 24. Transformational Bioinformatics | Denis C. Bauer | @allPowerde stuckincustoms Area Under the Precision/Recall Curve International Recognition
  25. 25. Implementation Transformational Bioinformatics | Denis C. Bauer | @allPowerde • GT-Scan2.0 is implemented as a AWS Lambda function • Server-less function: • Does not require users to have high-compute power • Scalable: • Can be easily scaled to whole genome analysis • Also intend to implement as a “stand-alone” • Can be run on local servers • Can incorporate your own ChIP-seq data rather than public data
  26. 26. On-demand instances vs Lambda Pro Con Lambda Instantaneously available Rel. small processing power Spark-cluster Unlimited processing power Spin-up time Transformational Bioinformatics | Denis C. Bauer | @allPowerde Sweet-spot for when large number of “nimble” small processors give a worse performance compared to a powerful cluster with overhead. Especially, with spin up overhead reduced with managers like cloudera Director.
  27. 27. Three things to remember • Large volumes of detailed data? VariantSpark, bringing bigLearning to genomics, can classify 3000 individuals and 80 million variants in under 30 minutes using Spark • Parallelizable tasks persistent cloud-availability? GT-Scan2, computationally guiding genome engineering, uses Chromatin information and the latest in cloud- compute to improve CRISPR target site identification • CSIRO specializes in using the latest advances in compute technology to push the boundary on bioinformatics problems Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  28. 28. Natalie Twine Acknowledgements Transformational Bioinformatics | Denis C. Bauer | @allPowerde Denis Bauer Oscar Luo Rob Dunne Piotr Szul Transformational Bioinformatics Team Aidan O’BrienLaurence Wilson Adrian White Mia Champion Gaetan Burgio Collaborators David LevyIan Blair Kelly Williams News Software Open Position Dan Andrews

×