20170402 Crop Innovation and Business - Amsterdam

Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience

GOOGLE CONFIDENTIAL
Google Cloud
Run your apps on the same system as Google

Environments
Genotypes
Quantifying Phenotypes
a
Googler’s
perspective

Generate Marker Fingerprint
Select & Recombine Sample tissue
Breeding
Genotyping Lab
Extract DNAAnalyze & Model Data
Grow
Marker-Assisted Breeding Rapidly Increases Frequency of
Favorable Genes
Cloud ML
TensorFlow

AI & ML
what you need to know
Machine Learning:
Make Machines
Learn
Artificial Intelligence:
Make Intelligent
Machines
programming a computer
to be intelligent is hard
programming a computer
to learn to be intelligent
is easier and progress is
measurable

* Human Performance
based on analysis done
by Andrej Karpathy.
More details here.
Image understanding is (getting) better than human level
ImageNet Challenge: Given
an image, predict one of
1000+ of classes
%errors

Deep Neural Networks: Algorithms that Learn
● Modernization of artificial neural networks
● Made of of simple mathematical units,
organized in layers, that together can
compute some (arbitrary) function
● more layers = deeper = more general
● Learn from raw, heterogeneous data

“Given an image,
predict one of
1000+ of classes”
Image credit:
360phot0.blogspot.com
ImageNet
Challenge

Released in Nov. 2015
#1
repository
for “machine learning”
category on GitHub
TensorFlow

Transfer Learning
Quickly able to Learn New Concepts
“t-rex”“quidditch”
Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

TensorFlow powered Cucumber Sorter

Genomics & Genetics Problems:
How to Start Applying DNNs?
Must-haves for deep learning:
● Lots of data: >50k examples, >1M examples ideal
● High-quality input and labels for training
● Label ~ F(data) unknown but certainly function exists
● High-quality prev. efforts so we know that DNNs are key
○ i.e. hard to solve with classical statistical
approaches
SNP and indel calling from NGS data

Environments
Phenotypes
Quantifying Genotypes

Creating a universal SNP and small indel
variant caller with deep neural networks
Ryan Poplin, Cory McLean, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy,
Sam Gross, Madeleine Cule, Peyton Greenside, Justin Zook, Marc Salit, Mark
DePristo, Verily Life Sciences, October 2016

DNN (Inception V3) Predicts True Genotype from Pileup Images
{ 0.001, 0.994, 0.005 }
{ 0.001, 0.990, 0.009 }
{ 0.000, 0.001, 0.999 }
{ 0.600, 0.399, 0.001 }
Output:
Probability of diploid
genotype states
{ HOM_REF, HET, HOM_VAR }
Raw pixels
Input:
Millions of labeled pileup
images from gold standard
samples

DeepVariant #1 in PrecisionFDA Truth Challenge
v2 => v3 truth set
for unblinded
sample
Unblinded =>
blinded sample with
v3 truth set
99.85
99.70
98.91

Genotypes
Phenotypes
Optimizing Environments
Quantifying
&

⬇40% Data Center cooling energy
⬆15% Power Usage Effectiveness (PUE)
Google’s Carbon-Neutral, Self-Optimizing Data Centers
The Dalles, Oregon, USA

anezconsulting.com/precision-agronomy/
Agronometric Integration
● Satellite & UAV
Images
● Geological Data
● Meteorological
& Sensor Data
● Cultivar Data
● Other GIS Data
● Yield Data

TensorFlow
https://cloudplatform.googleblog.com/2015/11/startup-spotlight-Descartes-Labs-monitors-planet-Earths-resources-with-Google-Compute-Engine.html

Public Datasets Project
https://cloud.google.com/bigquery/public-data/
A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a
special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications.
Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the
queries that you perform on the data (the first 1TB per month is free)

Environments
Genotypes
Optimizing Phenotypes

Marker-assisted selection for quantitative traits
“Marker Assisted
Selection”
&
“Quantitative
Trait Locus”
Occurrence in
Literature is
Increasing

GraphConnect SF 2015 / Graphs Are Feeding The World, Tim Williamson, Data Scientist, Monsanto
https://www.youtube.com/watch?v=6KEvLURBenM

PubSub
Queue
Sequencer
Reads
Genomics
APIs,
Docker
Revise
Models
Models
Cloud ML
MAB
Enhance
Percolate Streaming Sequencer Reads
for Real-time Model Updates
BigQuery

Google confidential │ Do not distribute
Google is good at handling massive volumes of data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s

Google confidential │ Do not distribute
Google can Handle Massive Amounts of Genomic Data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
~6 Maize WGS
>100x US PhDs
~1M WGS
0.25s

PubSub
Queue
Genomics
APIs,
Docker
Revise
Models
Models
MAB
Enhance
Percolate Streaming Sequencer Reads
for Real-time Model Updates
Who Else Needs This?
Sequencer
Reads
Cloud ML
BigQuery

New Public Dataset: 1K Cannabis
cloud.google.com/bigquery/public-data/1000-cannabis
Blog Post @ Medium:
DNA Sequencing of 1K Cannabis Strains publicly available in Google BigQuery
Open Source:
https://github.com/allenday/bfx-seq
Revise
Models
DNA
Reads

Build What’s Next
Thank You!
Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience

20170402 Crop Innovation and Business - Amsterdam

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20170402 Crop Innovation and Business - Amsterdam

Similar to 20170402 Crop Innovation and Business - Amsterdam (20)

More from Allen Day, PhD

More from Allen Day, PhD (19)

Recently uploaded

Recently uploaded (9)

20170402 Crop Innovation and Business - Amsterdam