Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How AI and ‘Big Learning’ transforms life science research - AI Dev Days Closing Keynote - Dr. Denis Bauer

4,239 views

Published on

This presentation was given by Dr. Denis Bauer as closing keynote in AI Dev Days conference on 9th March 2018 in Bangalore. URL: www.aidevdays.com
--
The talk showcases how genomic research has leapfrogged to the forefront of BigData and Cloud solutions. I outline how Apache Spark helps identify genomic association on population-scale whole genome sequencing data, as well as how the accuracy of genome editing approaches can be improved with massively parallel server-less cloud functions.

Published in: Software
  • Be the first to comment

How AI and ‘Big Learning’ transforms life science research - AI Dev Days Closing Keynote - Dr. Denis Bauer

  1. 1. How novel compute technology transforms life science research HEALTH AND BIOSECURITY Denis Bauer | PhD 9 Mar 2018, AI Dev Day, India 2018
  2. 2. Bioinformatics | Denis C. Bauer | @allPowerde2 | GT-Scan2 How can genome engineering be made more effective? VariantSpark How to find disease genes in population- size cohorts? CSIRO How to facilitate better collaborations? Overview
  3. 3. Source PwC
  4. 4. Bioinformatics | Denis C. Bauer | @allPowerde4 | Team CSIRO 5319 talented staff $1billion+ budget Working with over 2800+ industry partners 55 sites across Australia Top 1% of global research agencies Each year 6 CSIRO technologies contribute $5 billion to the economy
  5. 5. Bioinformatics | Denis C. Bauer | @allPowerde5 | Big ideas start here EXTENDED WEAR CONTACTS POLYMER BANKNOTES RELENZA FLU TREATMENT Fast WLAN Wireless Local Area Network AEROGARD TOTAL WELLBEING DIET RAFT POLYMERISATION BARLEYmax™ SELF TWISTING YARN SOFTLY WASHING LIQUID HENDRA VACCINE NOVACQ™ PRAWN FEED
  6. 6. Bioinformatics | Denis C. Bauer | @allPowerde6 | Convenient cardiac rehabilitation Enhancing relationship between patient and mentor Digital data collection Equitable access World's first, clinically validated smartphone based Cardiac Rehab: uptake + 30% and completion +70%
  7. 7. Bioinformatics | Denis C. Bauer | @allPowerde7 | GT-Scan2 How can genome engineering be made more effective? VariantSpark How to find disease genes in population- size cohorts? CSIRO How to facilitate better collaborations? Overview
  8. 8. By 2025 it is estimated that 50% of the world population will have been sequenced. Bioinformatics | Denis C. Bauer | @allPowerde8 | Frost&Sullivan
  9. 9. Genomics affects looks, disease risk, and behavior Bioinformatics | Denis C. Bauer | @allPowerde9 | wisegeek
  10. 10. Bioinformatics | Denis C. Bauer | @allPowerde10 | Genomics will outpace other BigData disciplines Stephens et al. PLOS Biology 2015 Astronomy Twitter YouTube Genomics
  11. 11. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Machine learning on 1.7 Trillion data points https://www.projectmine.com/about/ BigData Focus
  12. 12. Bioinformatics | Denis C. Bauer | @allPowerde12 | Finding the disease gene(s) • Spot the variant that is common amongst all affected but absent in all unaffected* * oversimplified cases controls Gene1 Gene2
  13. 13. Bioinformatics | Denis C. Bauer | @allPowerde13 | Complex diseases are driven by multiple genes • However, individual strong contributors are rare… cases controls Need a more sophisticated ML approach, such as Random Forest
  14. 14. Bioinformatics | Denis C. Bauer | @allPowerde14 | Machine learning on 1.7 Trillion data points 80 Million features Individuals Genomic profile Disease status22,500 samples Disease genes
  15. 15. Bioinformatics | Denis C. Bauer | @allPowerde15 | Population-scale genomic data analysis requires BigData solutions Desktop compute High-performance compute cluster Hadoop/Spark compute cluster Focus small data Compute-intensive Data-intensive Fault tolerant No No Yes Node-bound Yes Yes No Parallelization 10 CPU 100 CPU 1000 CPU Parallelization procedure bespoke bespoke standardized CSIRO solution
  16. 16. Bioinformatics | Denis C. Bauer | @allPowerde BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4) 16 | In the top 5% of all research outputs scored by Altmetric Spark Core Spark ML MLlib Variant Spark RESEARCH
  17. 17. Bioinformatics | Denis C. Bauer | @allPowerde17 | Parallelization of Random forest to scale with samples • Spark ML’s RF was designed for ‘Big’ low dimensional data. • The full genome-wide profile does not fit into the executors memory rendering the approach infeasible. “Cursed” BigData: e.g. Genomics Moderate number of samples with many features Feature set too large to be handled by single executer
  18. 18. Bioinformatics | Denis C. Bauer | @allPowerde18 | Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK Flip the matrix: partition by column (take 2) Parallelization of Random forest to scale with features
  19. 19. Bioinformatics | Denis C. Bauer | @allPowerde19 | Wide Random-forest scalable with features and samples
  20. 20. Runtime comparison Bioinformatics | Denis C. Bauer | @allPowerde20 | ● ● ● ● ● ●0 5 10 15 0.030.040.05 accuracy (out−of−bag error) trainingvariantspersecond(Million) Program ● ● ● ● ● ● VariantSpark SparkML Randomforest(R) ranger(R) ranger(C++) H2O low Accuracy high lowSpeed high
  21. 21. Machine learning on ‘wide’ data Bioinformatics | Denis C. Bauer | @allPowerde21 | 80 Million features Individuals Genomic profile Disease status 22,500 samples Disease genes Time series, Concatenated data, Sensor Data, Log files Churn-rate, occurrence of failure, attacks Predictive markers
  22. 22. Workflow patters Bioinformatics | Denis C. Bauer | @allPowerde22 | Business Problem Curate Data • Cleaning • Visualization Build MVP • Scope Tech • Prototypes • Iterate Prep for Production • Test at scale • Provide endpoint On premises Databricks AWS Sagemaker Execution Options
  23. 23. VariantSpark Demo Bioinformatics | Denis C. Bauer | @allPowerde23 | https://docs.databricks.com/spark/latest /training/variant-spark.html
  24. 24. Building a community • Build a robust ‘wide’ Machine Learning library for business • Research and potentially cure genetic diseases along the way • Can you help? Bioinformatics | Denis C. Bauer | @allPowerde24 | YES ! Lynn Langit Performance comparison Python API
  25. 25. Bioinformatics | Denis C. Bauer | @allPowerde25 | GT-Scan2 How can genome engineering be made more effective? VariantSpark How to find disease genes in population- size cohorts? CSIRO How to facilitate better collaborations? Overview
  26. 26. Bioinformatics | Denis C. Bauer | @allPowerde Genome Editing can correct genetic diseases, such as hypertrophic cardiomyopathy. However, editing does not work every time, e.g. only 7 in 10 embryos were mutation free. Aim: Develop computational guidance framework to enable edits the first time; every time. 26 | Ma et al. Nature 2017 * * Some controversy around the paper
  27. 27. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Improve performance SPEED • Each search can be broken down into parallel tasks - each takes seconds SCALE • Researchers might want to search the target for one gene or 100,000 Agility + Scalability =
  28. 28. One of the first Serverless Applications in Research Transformational Bioinformatics | Denis C. Bauer | @allPowerde Featured in Used by
  29. 29. Serverless systems are hard to optimize
  30. 30. 25 50 75 getFastaSequence createJobtargetScan offtargetScanStarter offtargetSearch targetIntersects targetTranscriptionIntersects targetW uScorer targetSgR N AScorer O nTargetScorer genom eC R ISPRfunctions runtime(s) Type base old GTScan2 X-Ray Analysis Transformational Bioinformatics | Denis C. Bauer | @allPowerde 25 50 75 getFastaSequencecreateJobtargetScan offtargetScanStarter offtargetSearch targetIntersects targetTranscriptionIntersects targetW uScorer targetSgR N AScorer tuscanR egressor O nTargetScorer genom eC R ISPRfunctions runtime(s) Type base new old
  31. 31. Results – 4x Faster (80% improvement) Transformational Bioinformatics | Denis C. Bauer | @allPowerde 0 40 80 120 old new implementation runtime(s) factor(Name) onTargetScoring base 2 min 30 sec
  32. 32. Bioinformatics | Denis C. Bauer | @allPowerde32 | Using hypothesis-driven architecture Architecture as text Evolve Automatic performance measure Evaluate James Lewis https://www.epsagon.com/ 10 DevOps DevOps changes, enabling a faster evidence- based pace of architecture development
  33. 33. Bioinformatics | Denis C. Bauer | @allPowerde33 | Open source tools
  34. 34. Bioinformatics | Denis C. Bauer | @allPowerde34 | Three things to remember • ‘Datafication’ will make data ‘wider’ -> paradigm shift for Machine Learning applications • Serverless architecture can cater for compute-intensive tasks • Business and life-science research are not that different: let’s build a community together!
  35. 35. Bioinformatics | Denis C. Bauer | @allPowerde35 | Bioinformatics | Denis C. Bauer | @allPowerde Transformational Bioinformatics Denis Bauer, PhD Oscar Luo, PhD Rob Dunne, PhD Piotr Szul Team Aidan O’BrienLaurence Wilson, PhD Collaborators News Software Kaitao Lai, PhD 35 | Arash Bayat Lynn Langit Natalie Twine, PhD Top 10 Australian IT stories of 2017

×