Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser University

535 views

Published on

Deep Learning Applications in Genomics

Published in: Science
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser University

  1. 1. Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience
  2. 2. GOOGLE CONFIDENTIAL Google Cloud Run your apps on the same system as Google
  3. 3. Google confidential │ Do not distribute Google is good at handling massive volumes of data uploads per minute users search index query response time 400hrs 500M+ 100PB+ 0.25s
  4. 4. Google confidential │ Do not distribute Google can handle massive volumes of genomic data uploads per minute users search index query response time 400hrs 500M+ 100PB+ 0.25s ~8WGS >100x US PhDs ~1M WGS 0.25s
  5. 5. Deep Neural Networks: Algorithms that Learn ● Modernization of artificial neural networks ● Made of of simple mathematical units, organized in layers, that together can compute some (arbitrary) function ● more layers = deeper = more general ● Learn from raw, heterogeneous data
  6. 6. * Human Performance based on analysis done by Andrej Karpathy. More details here. Image understanding is (getting) better than human level ImageNet Challenge: Given an image, predict one of 1000+ of classes %errors
  7. 7. “Given an image, predict one of 1000+ of classes” Image credit: 360phot0.blogspot.com ImageNet Challenge
  8. 8. AI & ML what you need to know Machine Learning: Make Machines Learn Artificial Intelligence: Make Intelligent Machines programming a computer to be intelligent is hard programming a computer to learn to be intelligent is easier and progress is measurable
  9. 9. Google confidential │ Do not distributeGoogle confidential │ Do not distribute Google Genomics August 2015
  10. 10. Google confidential │ Do not distribute Google Genomics is more than infrastructure General-purpose cloud infrastructure Genomics-specific featuresGenomics API Virtual Machines & Storage Data Services & Tools
  11. 11. Google confidential │ Do not distribute BioQuery Analysis Engine Medical Records Genomics Devices Imaging Patient Reports Baseline Study Data Private Data Pharma Health Providers … Google’s vision to tackle complex health data Public Data
  12. 12. Google confidential │ Do not distribute BioQuery Analysis Engine Medical Records Genomics Devices Imaging Patient Reports Baseline Study Data Private Data Pharma Health Providers … Google’s vision to tackle complex health data Public Data
  13. 13. CONFIDENTIAL & PROPRIETARY 3.75 TERABYTES PER HUMAN 1.00 TB GENOME 2.00 TB EPIGENOME 0.70 TB TRANSCRIPTOME 0.06 TB METABOLOME 0.04 TB PROTEOME ~1 MB STANDARD LAB TESTS 5-YR LONGITUDINAL STUDY BASELINE STUDY: BIG DATA ANALYSIS Validate a pipeline to process complex phenotypic, biochemical, and genomic data ● Pilot Study (N=200) ○ Determine optimal biospecimen collection strategy for stable sampling and reproducible assays ○ Determine optimal assay methodology ○ Validate quality control methods ○ Validate device data against surrogate and primary endpoints ● Baseline Study (N=10,000+) ○ 6 cohorts from low to high risk for cardiovascular and cancer ○ Characterize human systems biology ○ Define normal values for a given parameter in heterogeneous states ○ Predict meaningful events ○ Validate wearable devices for human monitoring ○ Characterize transitions in disease state
  14. 14. Google confidential │ Do not distribute Knowledge: populations cluster together
  15. 15. Bioinformatics scientist: BigQuery enables fast tertiary analysis
  16. 16. Google Cloud Platform Dataflow + BigQuery Used for Extract, Transform, Load (ETL), analytics, real-time computation and process orchestration. cloud.google.com/dataflow Dataflow Run SQL queries against multi-terabyte datasets in seconds. cloud.google.com/bigquery BigQuery
  17. 17. Google Cloud Platform Dataflow + BigQuery Used for Extract, Transform, Load (ETL), analytics, real-time computation and process orchestration. cloud.google.com/dataflow Dataflow Run SQL queries against multi-terabyte datasets in seconds. cloud.google.com/bigquery BigQuery
  18. 18. Google Cloud Platform Dataflow + BigQuery
  19. 19. Released in Nov. 2015 #1 repository for “machine learning” category on GitHub TensorFlow
  20. 20. Style Transfer
  21. 21. Style Transfer
  22. 22. Transfer Learning Quickly able to Learn New Concepts “t-rex”“quidditch” Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images
  23. 23. TensorFlow powered Cucumber Sorter
  24. 24. ⬇40% Data Center cooling energy ⬆15% Power Usage Effectiveness (PUE) Google’s Carbon-Neutral, Self-Optimizing Data Centers The Dalles, Oregon, USA
  25. 25. Google Cloud Platform Verily: Assisting Pathologists in Detecting Cancer with Deep Learning research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html Prediction heatmaps produced by the algorithm had improved so much that the localization score (FROC) for the algorithm reached 89%, which significantly exceeded the score of 73% for a pathologist with no time constraint2 . We were not the only ones to see promising results, as other groups were getting scores as high as 81% with the same dataset. Model generalized very well, even to images that were acquired from a different hospital using different scanners. For full details, see our paper “Detecting Cancer Metastases on Gigapixel Pathology Images”.
  26. 26. Google Cloud Platform Integration with Geospatial, Management, and Terrestrial Sensor Data anezconsulting.com/precision-agronomy/
  27. 27. Google Cloud Platform Descartes Labs - Google Cloud Customer medium.com/@stevenpbrumby/corn-in-the-usa-d487dce84ee1 Cloud ML Engine TensorFlow
  28. 28. Google Cloud Platform Phenomobile, http://www.mdpi.com/2073-4395/4/3/349/htm See also: http://www.genomes2fields.org/
  29. 29. Google Cloud Platform Temporo-Spatial Imaging of Growing Plants
  30. 30. Genomics & Genetics Problems: How to Start Applying DNNs? Must-haves for deep learning: ● Lots of data: >50k examples, >1M examples ideal ● High-quality input and labels for training ● Label ~ F(data) unknown but certainly function exists ● High-quality prev. efforts so we know that DNNs are key ○ i.e. hard to solve with classical statistical approaches SNP and indel calling from NGS data
  31. 31. Verily | Confidential & Proprietary Calling genetic variation may seem easy...
  32. 32. Verily | Confidential & Proprietary ... but lots of places in the genome are difficult
  33. 33. Creating a universal SNP and small indel variant caller with deep neural networks Ryan Poplin, Cory McLean, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy, Sam Gross, Madeleine Cule, Peyton Greenside, Justin Zook, Marc Salit, Mark DePristo, Verily Life Sciences, October 2016
  34. 34. DNN (Inception V3) Predicts True Genotype from Pileup Images { 0.001, 0.994, 0.005 } { 0.001, 0.990, 0.009 } { 0.000, 0.001, 0.999 } { 0.600, 0.399, 0.001 } Output: Probability of diploid genotype states { HOM_REF, HET, HOM_VAR } Raw pixels Input: Millions of labeled pileup images from gold standard samples
  35. 35. Verily | Confidential & Proprietary Using deep learning for ultra-accurate mutation detection Input: Millions of labeled pileup image stacks from gold standard sample Raw pixels { 0.001, 0.994, 0.005 } { 0.001, 0.990, 0.009 } { 0.000, 0.001, 0.999 } { 0.600, 0.399, 0.001 } Output: Probability distribution over the three diploid genotype states { HOM_REF, HET, HOM_VAR } 35
  36. 36. Verily | Confidential & Proprietary Example DNA read pileup “images” true snps true indels false variants red = {A,C,G,T}. green = {quality score}. blue = {read strand}. alpha = {matches ref genome}.
  37. 37. Verily | Confidential & Proprietary PrecisionFDA: unique opportunity with blinded truth sample NA12878
  38. 38. Verily | Confidential & Proprietary DeepVariant won an award at PrecisionFDA competition 99.85 99.70 98.91 ● Overall F-measure combines SNP and indel performance ● Blinded sample shows no overfitting to NA12878 with Verily’s pipelines 38
  39. 39. Google confidential │ Do not distribute Example: GATK Analysis Pipeline ● Decouple process management from host configuration ● Portable across OS distros and clouds ● Consistent environment from development to production ● Immutable images New way: deploy containers Old way: install applications on host kernel libs app app app app libs app kernel libs app libs app libs app Makefiles, CWL, WDL (on a virtual machine) Dockerflow: Dataflow + Docker Benefits
  40. 40. > java -jar target/dockerflow*dependencies.jar --project=YOUR_PROJECT --workflow-file=hello.yaml --workspace=gs://YOUR_BUCKET/YOUR_FOLDER --runner=DataflowPipelineRunner To run it: Variant Calls Your Variant Caller 40 PubSub Queue Sequencer DNA Reads Genomics API Genomics API BigQuery Your Other ToolYour Aligner Genomics API
  41. 41. Public Datasets Project https://cloud.google.com/bigquery/public-data/ A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the queries that you perform on the data (the first 1TB per month is free)
  42. 42. Confidential & ProprietaryGoogle Cloud Platform 42 Platinum Genomes 1000 Genomes Medical (Human) Population-scale Genome Projects 1000 Bulls 10K Dog Genomes Veterinary / Agriculture Open Cannabis Project (see deck) 3K Rice Genome To Fields Panzea (1000 Maize) AgriculturePersonal Genome Project Human Microbiome Project NCBI GEO Human 100K Cancer Genome Atlas Many Other Interesting Datasets...

×