Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
GOOGLE CONFIDENTIAL
Google Cloud Platform lets you run your apps on the
same system as Google
GOOGLE CONFIDENTIAL
So you can focus on what matters
to your science
Google confidential │ Do not distribute
Google is good at handling massive volumes of data
uploads per minute
users
search...
Google confidential │ Do not distribute
Google can handle large amounts of genomic data
uploads per minute
users
search in...
Google confidential │ Do not distribute
BioQuery Analysis Engine
Medical Records Genomics Devices Imaging Patient Reports
...
Google confidential │ Do not distribute
Google Genomics is more than infrastructure
General-purpose
cloud infrastructure
G...
Google confidential │ Do not distribute
Information: principal coordinates analysis (1000 genomes)
Google confidential │ Do not distribute
Knowledge: populations cluster together
Bioinformatics scientist: BigQuery enables fast tertiary analysis
Compute Transition / Transversion Ratio
Exploring 1000 Genomes Variants
Count Homozygous and Heterozygous SNVs
Source: Greg McInnes, Stanford Center for Genomics and Personalized Medicine
Verily
Observation: programming a computer to be clever is harder than
programming a computer to learn to be clever.
Intro...
Verily
Data Features Predictions
Learning
algorithm
Feature
engineering
Coming up with features is difficult, time-consumi...
Verily
● Modern reincarnation of neural networks
● Collection of simple trainable mathematical
units, organized in layers,...
TensorFlow powered Cucumber Sorter
⬇40% Data Center cooling energy
⬆15% Power Usage Effectiveness (PUE)
Google’s Carbon-Neutral, Self-Optimizing Data Centers...
anezconsulting.com/precision-agronomy/
Agronometric Integration
● Satellite & UAV
Images
● Geological Data
● Meteorologica...
TensorFlow
https://cloudplatform.googleblog.com/2015/11/startup-spotlight-Descartes-Labs-monitors-planet-Earths-resources-...
Public Datasets Project
https://cloud.google.com/bigquery/public-data/
A public dataset is any dataset that is stored in B...
GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto
https://www.youtube.com/watc...
Verily | Confidential & Proprietary
Motivation
● Variant calling in next-generation sequencing:
○ Well-understood, hard in...
Verily | Confidential & Proprietary
Calling genetic variation may seem easy...
Verily | Confidential & Proprietary
... but lots of places in the genome are difficult
Creating a universal SNP and small indel
variant caller with deep neural networks
Ryan Poplin, Cory McLean, Dan Newburger,...
DNN (Inception V3) Predicts True Genotype from Pileup Images
{ 0.001, 0.994, 0.005 }
{ 0.001, 0.990, 0.009 }
{ 0.000, 0.00...
Verily | Confidential & Proprietary
Using deep learning for ultra-accurate mutation detection
Input:
Millions of labeled
p...
Verily | Confidential & Proprietary
Example DNA read pileup “images”
true snps true indels false variants
red = {A,C,G,T}....
Verily | Confidential & Proprietary
PrecisionFDA: unique opportunity with blinded truth sample
NA12878
Verily | Confidential & Proprietary
DeepVariant won an award at PrecisionFDA competition
99.85
99.70
98.91
● Overall F-mea...
Verily | Confidential & Proprietary
DeepVariant has the best site discovery accuracy
● Verily’s internal
assessment of
pre...
20170406 Genomics@Google - KeyGene - Wageningen
Upcoming SlideShare
Loading in …5
×

20170406 Genomics@Google - KeyGene - Wageningen

833 views

Published on

Genomics use cases for Agriculture using DeepLearning (DeepVariant) and Genomics API from Google Cloud

Published in: Science
  • Be the first to comment

20170406 Genomics@Google - KeyGene - Wageningen

  1. 1. GOOGLE CONFIDENTIAL Google Cloud Platform lets you run your apps on the same system as Google
  2. 2. GOOGLE CONFIDENTIAL So you can focus on what matters to your science
  3. 3. Google confidential │ Do not distribute Google is good at handling massive volumes of data uploads per minute users search index query response time 400hrs 500M+ 100PB+ 0.25s
  4. 4. Google confidential │ Do not distribute Google can handle large amounts of genomic data uploads per minute users search index query response time 400hrs 500M+ 100PB+ 0.25s ~8WGS >100x US PhDs ~1M WGS 0.25s
  5. 5. Google confidential │ Do not distribute BioQuery Analysis Engine Medical Records Genomics Devices Imaging Patient Reports Baseline Study Data Private Data Pharma Health Providers … Google’s vision to tackle complex health data Public Data
  6. 6. Google confidential │ Do not distribute Google Genomics is more than infrastructure General-purpose cloud infrastructure Genomics-specific featuresGenomics API Virtual Machines & Storage Data Services & Tools
  7. 7. Google confidential │ Do not distribute Information: principal coordinates analysis (1000 genomes)
  8. 8. Google confidential │ Do not distribute Knowledge: populations cluster together
  9. 9. Bioinformatics scientist: BigQuery enables fast tertiary analysis
  10. 10. Compute Transition / Transversion Ratio
  11. 11. Exploring 1000 Genomes Variants Count Homozygous and Heterozygous SNVs
  12. 12. Source: Greg McInnes, Stanford Center for Genomics and Personalized Medicine
  13. 13. Verily Observation: programming a computer to be clever is harder than programming a computer to learn to be clever. Intro to machine learning and deep learning
  14. 14. Verily Data Features Predictions Learning algorithm Feature engineering Coming up with features is difficult, time-consuming, and requires expert knowledge. When working with application of learning, we spend a lot of time tuning the features. Machine learning is powerful; features are hard
  15. 15. Verily ● Modern reincarnation of neural networks ● Collection of simple trainable mathematical units, organized in layers, that collaborate to compute a complicated function ● Learns features from raw, heterogeneous data ● Loosely inspired by what (little) we know about the brain The deep learning revolution
  16. 16. TensorFlow powered Cucumber Sorter
  17. 17. ⬇40% Data Center cooling energy ⬆15% Power Usage Effectiveness (PUE) Google’s Carbon-Neutral, Self-Optimizing Data Centers The Dalles, Oregon, USA
  18. 18. anezconsulting.com/precision-agronomy/ Agronometric Integration ● Satellite & UAV Images ● Geological Data ● Meteorological & Sensor Data ● Cultivar Data ● Other GIS Data ● Yield Data
  19. 19. TensorFlow https://cloudplatform.googleblog.com/2015/11/startup-spotlight-Descartes-Labs-monitors-planet-Earths-resources-with-Google-Compute-Engine.html
  20. 20. Public Datasets Project https://cloud.google.com/bigquery/public-data/ A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications. Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the queries that you perform on the data (the first 1TB per month is free)
  21. 21. GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto https://www.youtube.com/watch?v=6KEvLURBenM
  22. 22. Verily | Confidential & Proprietary Motivation ● Variant calling in next-generation sequencing: ○ Well-understood, hard inference problem in genomics. ○ Significant statistical modeling component. ○ Lots of opportunity for improvements ● DeepVariant: ○ Teach deep learning to call variants using aligned NGS reads
  23. 23. Verily | Confidential & Proprietary Calling genetic variation may seem easy...
  24. 24. Verily | Confidential & Proprietary ... but lots of places in the genome are difficult
  25. 25. Creating a universal SNP and small indel variant caller with deep neural networks Ryan Poplin, Cory McLean, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy, Sam Gross, Madeleine Cule, Peyton Greenside, Justin Zook, Marc Salit, Mark DePristo, Verily Life Sciences, October 2016
  26. 26. DNN (Inception V3) Predicts True Genotype from Pileup Images { 0.001, 0.994, 0.005 } { 0.001, 0.990, 0.009 } { 0.000, 0.001, 0.999 } { 0.600, 0.399, 0.001 } Output: Probability of diploid genotype states { HOM_REF, HET, HOM_VAR } Raw pixels Input: Millions of labeled pileup images from gold standard samples
  27. 27. Verily | Confidential & Proprietary Using deep learning for ultra-accurate mutation detection Input: Millions of labeled pileup image stacks from gold standard sample Raw pixels { 0.001, 0.994, 0.005 } { 0.001, 0.990, 0.009 } { 0.000, 0.001, 0.999 } { 0.600, 0.399, 0.001 } Output: Probability distribution over the three diploid genotype states { HOM_REF, HET, HOM_VAR } 28
  28. 28. Verily | Confidential & Proprietary Example DNA read pileup “images” true snps true indels false variants red = {A,C,G,T}. green = {quality score}. blue = {read strand}. alpha = {matches ref genome}.
  29. 29. Verily | Confidential & Proprietary PrecisionFDA: unique opportunity with blinded truth sample NA12878
  30. 30. Verily | Confidential & Proprietary DeepVariant won an award at PrecisionFDA competition 99.85 99.70 98.91 ● Overall F-measure combines SNP and indel performance ● Blinded sample shows no overfitting to NA12878 with Verily’s pipelines 31
  31. 31. Verily | Confidential & Proprietary DeepVariant has the best site discovery accuracy ● Verily’s internal assessment of precisionFDA submissions focusing on variant discovery accuracy in blinded truth sample

×