20170406 Genomics@Google - KeyGene - Wageningen

GOOGLE CONFIDENTIAL
Google Cloud Platform lets you run your apps on the
same system as Google

GOOGLE CONFIDENTIAL
So you can focus on what matters
to your science

Google confidential │ Do not distribute
Google is good at handling massive volumes of data
uploads per minute
users
search index
query response time
400hrs
500M+
100PB+
0.25s

Google can handle large amounts of genomic data
uploads per minute
users
search index
query response time
400hrs
500M+
100PB+
0.25s
~8WGS
>100x US PhDs
~1M WGS
0.25s

BioQuery Analysis Engine
Medical Records Genomics Devices Imaging Patient Reports
Baseline Study Data Private Data
Pharma Health Providers …
Google’s vision to tackle complex health data
Public Data

Google Genomics is more than infrastructure
General-purpose
cloud infrastructure
Genomics-specific
featuresGenomics API
Virtual Machines & Storage
Data Services & Tools

Information: principal coordinates analysis (1000 genomes)

Knowledge: populations cluster together

Bioinformatics scientist: BigQuery enables fast tertiary analysis

Compute Transition / Transversion Ratio

Exploring 1000 Genomes Variants
Count Homozygous and Heterozygous SNVs

Source: Greg McInnes, Stanford Center for Genomics and Personalized Medicine

Verily
Observation: programming a computer to be clever is harder than
programming a computer to learn to be clever.
Intro to machine learning and deep learning

Verily
Data Features Predictions
Learning
algorithm
Feature
engineering
Coming up with features is difficult, time-consuming, and requires expert knowledge.
When working with application of learning, we spend a lot of time tuning the features.
Machine learning is powerful; features are hard

Verily
● Modern reincarnation of neural networks
● Collection of simple trainable mathematical
units, organized in layers, that collaborate to
compute a complicated function
● Learns features from raw, heterogeneous data
● Loosely inspired by what (little) we know
about the brain
The deep learning revolution

TensorFlow powered Cucumber Sorter

⬇40% Data Center cooling energy
⬆15% Power Usage Effectiveness (PUE)
Google’s Carbon-Neutral, Self-Optimizing Data Centers
The Dalles, Oregon, USA

anezconsulting.com/precision-agronomy/
Agronometric Integration
● Satellite & UAV
Images
● Geological Data
● Meteorological
& Sensor Data
● Cultivar Data
● Other GIS Data
● Yield Data

TensorFlow
https://cloudplatform.googleblog.com/2015/11/startup-spotlight-Descartes-Labs-monitors-planet-Earths-resources-with-Google-Compute-Engine.html

Public Datasets Project
https://cloud.google.com/bigquery/public-data/
A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a
special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications.
Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the
queries that you perform on the data (the first 1TB per month is free)

GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto
https://www.youtube.com/watch?v=6KEvLURBenM

Verily | Confidential & Proprietary
Motivation
● Variant calling in next-generation sequencing:
○ Well-understood, hard inference problem in genomics.
○ Significant statistical modeling component.
○ Lots of opportunity for improvements
● DeepVariant:
○ Teach deep learning to call variants using aligned NGS reads

Calling genetic variation may seem easy...

... but lots of places in the genome are difficult

Creating a universal SNP and small indel
variant caller with deep neural networks
Ryan Poplin, Cory McLean, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy,
Sam Gross, Madeleine Cule, Peyton Greenside, Justin Zook, Marc Salit, Mark
DePristo, Verily Life Sciences, October 2016

DNN (Inception V3) Predicts True Genotype from Pileup Images
{ 0.001, 0.994, 0.005 }
{ 0.001, 0.990, 0.009 }
{ 0.000, 0.001, 0.999 }
{ 0.600, 0.399, 0.001 }
Output:
Probability of diploid
genotype states
{ HOM_REF, HET, HOM_VAR }
Raw pixels
Input:
Millions of labeled pileup
images from gold standard
samples

Using deep learning for ultra-accurate mutation detection
Input:
Millions of labeled
pileup image
stacks from gold
standard sample
Raw pixels
{ 0.001, 0.994, 0.005 }
{ 0.001, 0.990, 0.009 }
{ 0.000, 0.001, 0.999 }
{ 0.600, 0.399, 0.001 }
Output:
Probability distribution
over the three diploid
genotype states
{ HOM_REF, HET, HOM_VAR }
28

Example DNA read pileup “images”
true snps true indels false variants
red = {A,C,G,T}. green = {quality score}. blue = {read strand}.
alpha = {matches ref genome}.

PrecisionFDA: unique opportunity with blinded truth sample
NA12878

DeepVariant won an award at PrecisionFDA competition
99.85
99.70
98.91
● Overall F-measure
combines SNP and
indel performance
● Blinded sample
shows no
overfitting to
NA12878 with
Verily’s pipelines
31

DeepVariant has the best site discovery accuracy
● Verily’s internal
assessment of
precisionFDA
submissions
focusing on
variant
discovery
accuracy in
blinded truth
sample

20170406 Genomics@Google - KeyGene - Wageningen

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20170406 Genomics@Google - KeyGene - Wageningen

Similar to 20170406 Genomics@Google - KeyGene - Wageningen (20)

More from Allen Day, PhD

More from Allen Day, PhD (19)

Recently uploaded

Recently uploaded (20)

20170406 Genomics@Google - KeyGene - Wageningen