More Related Content

Similar to 20170406 Genomics@Google - KeyGene - Wageningen(20)


More from Allen Day, PhD(19)


20170406 Genomics@Google - KeyGene - Wageningen

  1. GOOGLE CONFIDENTIAL Google Cloud Platform lets you run your apps on the same system as Google
  2. GOOGLE CONFIDENTIAL So you can focus on what matters to your science
  3. Google confidential │ Do not distribute Google is good at handling massive volumes of data uploads per minute users search index query response time 400hrs 500M+ 100PB+ 0.25s
  4. Google confidential │ Do not distribute Google can handle large amounts of genomic data uploads per minute users search index query response time 400hrs 500M+ 100PB+ 0.25s ~8WGS >100x US PhDs ~1M WGS 0.25s
  5. Google confidential │ Do not distribute BioQuery Analysis Engine Medical Records Genomics Devices Imaging Patient Reports Baseline Study Data Private Data Pharma Health Providers … Google’s vision to tackle complex health data Public Data
  6. Google confidential │ Do not distribute Google Genomics is more than infrastructure General-purpose cloud infrastructure Genomics-specific featuresGenomics API Virtual Machines & Storage Data Services & Tools
  7. Google confidential │ Do not distribute Information: principal coordinates analysis (1000 genomes)
  8. Google confidential │ Do not distribute Knowledge: populations cluster together
  9. Bioinformatics scientist: BigQuery enables fast tertiary analysis
  10. Compute Transition / Transversion Ratio
  11. Exploring 1000 Genomes Variants Count Homozygous and Heterozygous SNVs
  12. Source: Greg McInnes, Stanford Center for Genomics and Personalized Medicine
  13. Verily Observation: programming a computer to be clever is harder than programming a computer to learn to be clever. Intro to machine learning and deep learning
  14. Verily Data Features Predictions Learning algorithm Feature engineering Coming up with features is difficult, time-consuming, and requires expert knowledge. When working with application of learning, we spend a lot of time tuning the features. Machine learning is powerful; features are hard
  15. Verily ● Modern reincarnation of neural networks ● Collection of simple trainable mathematical units, organized in layers, that collaborate to compute a complicated function ● Learns features from raw, heterogeneous data ● Loosely inspired by what (little) we know about the brain The deep learning revolution
  16. TensorFlow powered Cucumber Sorter
  17. ⬇40% Data Center cooling energy ⬆15% Power Usage Effectiveness (PUE) Google’s Carbon-Neutral, Self-Optimizing Data Centers The Dalles, Oregon, USA
  18. Agronometric Integration ● Satellite & UAV Images ● Geological Data ● Meteorological & Sensor Data ● Cultivar Data ● Other GIS Data ● Yield Data
  19. TensorFlow
  20. Public Datasets Project A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the queries that you perform on the data (the first 1TB per month is free)
  21. GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto
  22. Verily | Confidential & Proprietary Motivation ● Variant calling in next-generation sequencing: ○ Well-understood, hard inference problem in genomics. ○ Significant statistical modeling component. ○ Lots of opportunity for improvements ● DeepVariant: ○ Teach deep learning to call variants using aligned NGS reads
  23. Verily | Confidential & Proprietary Calling genetic variation may seem easy...
  24. Verily | Confidential & Proprietary ... but lots of places in the genome are difficult
  25. Creating a universal SNP and small indel variant caller with deep neural networks Ryan Poplin, Cory McLean, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy, Sam Gross, Madeleine Cule, Peyton Greenside, Justin Zook, Marc Salit, Mark DePristo, Verily Life Sciences, October 2016
  26. DNN (Inception V3) Predicts True Genotype from Pileup Images { 0.001, 0.994, 0.005 } { 0.001, 0.990, 0.009 } { 0.000, 0.001, 0.999 } { 0.600, 0.399, 0.001 } Output: Probability of diploid genotype states { HOM_REF, HET, HOM_VAR } Raw pixels Input: Millions of labeled pileup images from gold standard samples
  27. Verily | Confidential & Proprietary Using deep learning for ultra-accurate mutation detection Input: Millions of labeled pileup image stacks from gold standard sample Raw pixels { 0.001, 0.994, 0.005 } { 0.001, 0.990, 0.009 } { 0.000, 0.001, 0.999 } { 0.600, 0.399, 0.001 } Output: Probability distribution over the three diploid genotype states { HOM_REF, HET, HOM_VAR } 28
  28. Verily | Confidential & Proprietary Example DNA read pileup “images” true snps true indels false variants red = {A,C,G,T}. green = {quality score}. blue = {read strand}. alpha = {matches ref genome}.
  29. Verily | Confidential & Proprietary PrecisionFDA: unique opportunity with blinded truth sample NA12878
  30. Verily | Confidential & Proprietary DeepVariant won an award at PrecisionFDA competition 99.85 99.70 98.91 ● Overall F-measure combines SNP and indel performance ● Blinded sample shows no overfitting to NA12878 with Verily’s pipelines 31
  31. Verily | Confidential & Proprietary DeepVariant has the best site discovery accuracy ● Verily’s internal assessment of precisionFDA submissions focusing on variant discovery accuracy in blinded truth sample