This document discusses applying machine learning and artificial intelligence techniques like deep neural networks to problems in genomics and agriculture. It provides examples of using Google Cloud platforms and services for storing and analyzing large genomic datasets, as well as developing models for tasks like variant calling from sequencing data and marker-assisted breeding. The document advocates that Google is well-positioned to handle massive volumes of genomic and agricultural data and help advance the application of AI in these domains.
In this session we will explore how Google's Cloud services (CloudML, Vision, Genomics API) can be used to process genomic and phenotypic data and solve problems in healthcare and agriculture.
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
Deep learning has enabled dramatic advances in image recognition performance. In this talk I will discuss using a deep convolutional neural network to detect genetic variation in aligned next-generation sequencing human read data. Our method, called DeepVariant, both outperforms existing genotyping tools and generalizes across genome builds and even to other species. DeepVariant represents a significant step from expert-driven statistical modeling towards more automatic deep learning approaches for developing software to interpret biological instrumentation data.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
In this session we will explore how Google's Cloud services (CloudML, Vision, Genomics API) can be used to process genomic and phenotypic data and solve problems in healthcare and agriculture.
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
Deep learning has enabled dramatic advances in image recognition performance. In this talk I will discuss using a deep convolutional neural network to detect genetic variation in aligned next-generation sequencing human read data. Our method, called DeepVariant, both outperforms existing genotyping tools and generalizes across genome builds and even to other species. DeepVariant represents a significant step from expert-driven statistical modeling towards more automatic deep learning approaches for developing software to interpret biological instrumentation data.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
Talking Data is the largest independent big data service company in China. Their network covers 70% of the mobile services nationwide with 3 billion ad clicks per day. Amongst those clicks, 90% are potentially fraudulent. Click fraud is happening at an overwhelming volume leading to misusage of data and wasting money. Hence, Kaggle (a platform for predictive modeling and analytics competitions from the U.S.) has partnered up with TalkingData to help resolve this issue.
This paper is to build predictive analysis models using traditional and Big Data methods to determine whether a smartphone app will be downloaded after clicking an advertisement. We have used data named “TalkingData AdTracking Fraud Detection Challenge”, which is of 7GB and given by a Kaggle competition. Four classification models are implemented with this massive data set in order to predict fraud in both traditional and Big Data methods. We define it fraud when the user clicked on an advertisement without downloading. The traditional platform has a resource limitation to build models with data set over a giga-byte so that we generate a sample data for the traditional models and adopt the full data set for the models in the Big Data Spark ML systems. We also present the accuracy and performance of the models implemented in both traditional and Big Data systems.
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Edge-based Discovery of Training Data for Machine LearningZiqiang Feng
(Accepted and presented in Symposium on Edge Computing, Seattle, Oct 2018)
We show how edge-based early discard of data can greatly improve the productivity of a human expert in assembling a large training set for machine learning. This task may span multiple data sources that are live (e.g., video cameras) or archival (data sets dispersed over the Internet). The critical resource here is the attention of the expert. We describe Eureka, an interactive system that leverages edge computing to greatly improve the productivity of experts in this task. Our experimental results show that Eureka reduces the labeling effort needed to construct a training set by two orders of magnitude relative to a brute-force approach.
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
Multiple Regression Analysis and Covid-19 policy is the contemporary agenda. It demonstrates how to use Python to do data wrangler, to use R to do statistical analysis, and is enable to publish in standard academic journal. The model will explain whether lockdown policy is relevant to control Covid-19 outbreak? It cinc
A global revolution is in full swing, and the Sustainable Brands Conference is where sustainability, brand and innovation leaders gather to learn, share and strategize to shape the future. SB'12 was the largest gathering to date, a kinetic convergence of innovators from more than 150 companies from around the world finding new ways to create monumental disruption in traditional models of commerce and consumption.
ICIC 2017: The Next Era: Deep Learning for Biomedical ResearchDr. Haxel Consult
Srinivasan Parthiban (VINGYANI, India)
Deep learning is hot, making waves, delivering results, and is somewhat of a buzzword today. There is a desire to apply deep learning to anything that is digital. Unlike the brain, these artificial neural networks have a very strict predefined structure. The brain is made up of neurons that talk to each other via electrical and chemical signals. We do not differentiate between these two types of signals in artificial neural networks. They are essentially a series of advanced statistics based exercises that review the past to indicate the likely future. Another buzzword that was used for the last few years across all industries is “big data”. In biomedical and health sciences, both unstructured and structured information constitute "big data". On the one hand deep learning needs lot of data whereas “big data" has value only when it generates actionable insight. Given this, these two areas are destined to be married. The couple is made for each other. The time is ripe now for a synergistic association that will benefit the pharmaceutical companies. It may be only a short time before we have vice presidents of machine learning or deep learning in pharmaceutical and biotechnology companies. This presentation will review the prominent deep learning methods and discuss these techniques for their usefulness in biomedical and health informatics.
Laurie Goodman: Sharing and Reusing Cell Image Data, ASCB/EMBO 2017 Subgroup ...GigaScience, BGI Hong Kong
Laurie Goodman's pre-prepared slides for the Subgroup S Sharing and Reusing Cell Image Data session at the 2017 ASCB│EMBO meeting in Philadelphia. December 2017
Cloud Accelerated Genomics by Allen Day of GoogleData Con LA
Abstract:
In this session we will explore how Google's Cloud services (CloudML, Vision, Genomics API) can be used to process genomic and phenotypic data and solve problems in healthcare and agriculture.
Bio:
Allen Day is a Science Advocate with Google Cloud. He's a professional software developer and storyteller with expertise in computational biology, statistics, and distributed computing. Prior to joining Google in Seattle, Allen was based in Singapore as Chief Scientist at MapR, a Silicon Valley BigData platform company.
Talking Data is the largest independent big data service company in China. Their network covers 70% of the mobile services nationwide with 3 billion ad clicks per day. Amongst those clicks, 90% are potentially fraudulent. Click fraud is happening at an overwhelming volume leading to misusage of data and wasting money. Hence, Kaggle (a platform for predictive modeling and analytics competitions from the U.S.) has partnered up with TalkingData to help resolve this issue.
This paper is to build predictive analysis models using traditional and Big Data methods to determine whether a smartphone app will be downloaded after clicking an advertisement. We have used data named “TalkingData AdTracking Fraud Detection Challenge”, which is of 7GB and given by a Kaggle competition. Four classification models are implemented with this massive data set in order to predict fraud in both traditional and Big Data methods. We define it fraud when the user clicked on an advertisement without downloading. The traditional platform has a resource limitation to build models with data set over a giga-byte so that we generate a sample data for the traditional models and adopt the full data set for the models in the Big Data Spark ML systems. We also present the accuracy and performance of the models implemented in both traditional and Big Data systems.
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Edge-based Discovery of Training Data for Machine LearningZiqiang Feng
(Accepted and presented in Symposium on Edge Computing, Seattle, Oct 2018)
We show how edge-based early discard of data can greatly improve the productivity of a human expert in assembling a large training set for machine learning. This task may span multiple data sources that are live (e.g., video cameras) or archival (data sets dispersed over the Internet). The critical resource here is the attention of the expert. We describe Eureka, an interactive system that leverages edge computing to greatly improve the productivity of experts in this task. Our experimental results show that Eureka reduces the labeling effort needed to construct a training set by two orders of magnitude relative to a brute-force approach.
Multipleregression covidmobility and Covid-19 policy recommendationKan Yuenyong
Multiple Regression Analysis and Covid-19 policy is the contemporary agenda. It demonstrates how to use Python to do data wrangler, to use R to do statistical analysis, and is enable to publish in standard academic journal. The model will explain whether lockdown policy is relevant to control Covid-19 outbreak? It cinc
A global revolution is in full swing, and the Sustainable Brands Conference is where sustainability, brand and innovation leaders gather to learn, share and strategize to shape the future. SB'12 was the largest gathering to date, a kinetic convergence of innovators from more than 150 companies from around the world finding new ways to create monumental disruption in traditional models of commerce and consumption.
ICIC 2017: The Next Era: Deep Learning for Biomedical ResearchDr. Haxel Consult
Srinivasan Parthiban (VINGYANI, India)
Deep learning is hot, making waves, delivering results, and is somewhat of a buzzword today. There is a desire to apply deep learning to anything that is digital. Unlike the brain, these artificial neural networks have a very strict predefined structure. The brain is made up of neurons that talk to each other via electrical and chemical signals. We do not differentiate between these two types of signals in artificial neural networks. They are essentially a series of advanced statistics based exercises that review the past to indicate the likely future. Another buzzword that was used for the last few years across all industries is “big data”. In biomedical and health sciences, both unstructured and structured information constitute "big data". On the one hand deep learning needs lot of data whereas “big data" has value only when it generates actionable insight. Given this, these two areas are destined to be married. The couple is made for each other. The time is ripe now for a synergistic association that will benefit the pharmaceutical companies. It may be only a short time before we have vice presidents of machine learning or deep learning in pharmaceutical and biotechnology companies. This presentation will review the prominent deep learning methods and discuss these techniques for their usefulness in biomedical and health informatics.
Laurie Goodman: Sharing and Reusing Cell Image Data, ASCB/EMBO 2017 Subgroup ...GigaScience, BGI Hong Kong
Laurie Goodman's pre-prepared slides for the Subgroup S Sharing and Reusing Cell Image Data session at the 2017 ASCB│EMBO meeting in Philadelphia. December 2017
Cloud Accelerated Genomics by Allen Day of GoogleData Con LA
Abstract:
In this session we will explore how Google's Cloud services (CloudML, Vision, Genomics API) can be used to process genomic and phenotypic data and solve problems in healthcare and agriculture.
Bio:
Allen Day is a Science Advocate with Google Cloud. He's a professional software developer and storyteller with expertise in computational biology, statistics, and distributed computing. Prior to joining Google in Seattle, Allen was based in Singapore as Chief Scientist at MapR, a Silicon Valley BigData platform company.
Sophos' Greg Iddon peels back the layers of jargon surrounding the machine learning field and explains how the security industry is supplementing reactive, human-based malware research with predictive machine learning models to defend against the relentless onslaught of malware.
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
Does more data always improve ML models? Is it better to use distributed ML instead of single node ML?
In this talk I will show that while more data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video more data does not significantly improve high bias problem spaces where traditional ML is more appropriate. Additionally, even in the deep learning domain, single node models can still outperform distributed models via transfer learning.
Data scientists have pain points running many models in parallel automating the experimental set up. Getting others (especially analysts) within an organization to use their models Databricks solves these problems using pandas udfs, ml runtime and MLflow.
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleDawn Yankeelov
"Sippin: A Mobile Application Case Study," was presented at Techfest Louisville 2017 hosted by the Technology Association of Louisville Kentucky on Aug. 16th-17th.
basics of GAN neural network
GAN is a advanced tech in area of neural networks which will help to generate new data . This new data will be developed based over the past experiences and raw data.
Deep Learning: concepts and use cases (October 2018)Julien SIMON
An introduction to Deep Learning theory
Neurons & Neural Networks
The Training Process
Backpropagation
Optimizers
Common network architectures and use cases
Convolutional Neural Networks
Recurrent Neural Networks
Long Short Term Memory Networks
Generative Adversarial Networks
Getting started
This talk was presented in Startup Master Class 2017 - http://aaiitkblr.org/smc/ 2017 @ Christ College Bangalore. Hosted by IIT Kanpur Alumni Association and co-presented by IIT KGP Alumni Association, IITACB, PanIIT, IIMA and IIMB alumni.
My co-presenter was Biswa Gourav Singh. And contributor was Navin Manaswi.
http://dataconomy.com/2017/04/history-neural-networks/ - timeline for neural networks
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
Machine Intelligence at Google Scale: Tensor Flow and Cloud Machine Learning: The biggest challenge of Deep Learning technology is the scalability. As long as using single GPU server, you have to wait for hours or days to get the result of your work. This doesn’t scale for production service, so you need a Distributed Training on the cloud eventually. Google has been building infrastructure for training the large scale neural network on the cloud for years, and now started to share the technology with external developers. In this session, we will introduce new pre-trained ML services such as Cloud Vision API and Speech API that works without any training. Also, we will look how TensorFlow and Cloud Machine Learning will accelerate custom model training for 10x – 40x with Google’s distributed training infrastructure.
Performance evaluation of GANs in a semisupervised OCR use caseFlorian Wilhelm
Even in the age of big data, labeled data is a scarce resource in many machine learning use cases. Florian Wilhelm evaluates generative adversarial networks (GANs) when used to extract information from vehicle registrations under a varying amount of labeled data, compares the performance with supervised learning techniques, and demonstrates a significant improvement when using unlabeled data.
Performance evaluation of GANs in a semisupervised OCR use caseinovex GmbH
Online vehicle marketplaces are embracing artificial intelligence to ease the process of selling a vehicle on their platform. The tedious work of copying information from the vehicle registration document into some web form can be automated with the help of smart text-spotting systems, in which the seller takes a picture of the document, and the necessary information is extracted automatically.
Florian Wilhelm details the components of a text-spotting system, including the subtasks of object detection and optical character recognition (OCR). Florian elaborates on the challenges of OCR in documents with various distortions and artifacts, which rule out off-the-shelf products for this task. After offering an overview of semisupervised learning based on generative adversarial networks (GANs), Florian evaluates the performance gains of this method compared to supervised learning. More specifically, for a varying amount of labeled data, he compares the accuracy of a convolution neural network (CNN) to a GANthat uses additional unlabeled data during the training phase, showing that GANs significantly outperform classical CNNs in use cases with a lack of labeled data.
What you'll learn:
Understand how semisupervised learning with GANs works
Explore beneficial semisupervised methods based on GANs for use cases with a limited amount of labeled data
Gain insight into an interesting OCR use case of an online vehicle marketplace
Event: O'Reilly Artificial Intelligence Conference, London, 11.10.2018
Speaker: Dr. Florian Wilhelm
Mehr Tech-Vorträge: www.inovex.de/vortraege
Mehr Tech-Artikel: www.inovex.de/blog
II-SDV 2017: The Next Era: Deep Learning for Biomedical ResearchDr. Haxel Consult
Deep learning is hot, making waves, delivering results, and is somewhat of a buzzword today. There is a desire to apply deep learning to anything that is digital. Unlike the brain, these artificial neural networks have a very strict predefined structure. The brain is made up of neurons that talk to each other via electrical and chemical signals. We do not differentiate between these two types of signals in artificial neural networks. They are essentially a series of advanced statistics based exercises that review the past to indicate the likely future. Another buzzword that was used for the last few years across all industries is “big data”. In biomedical and health sciences, both unstructured and structured information constitute "big data". On the one hand deep learning needs lot of data whereas “big data" has value only when it generates actionable insight. Given this, these two areas are destined to be married. The couple is made for each other. The time is ripe now for a synergistic association that will benefit the pharmaceutical companies. It may be only a short time before we have vice presidents of machine learning or deep learning in pharmaceutical and biotechnology companies. This presentation will review the prominent deep learning methods and discuss these techniques for their usefulness in biomedical and health informatics.
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
As the adoption of AI technologies increases and matures, the focus will shift from exploration to time to market, productivity and integration with existing workflows. Governing Enterprise data, scaling AI model development, selecting a complete, collaborative hybrid platform and tools for rapid solution deployments are key focus areas for growing data scientist teams tasked to respond to business challenges. This talk will cover the challenges and innovations for AI at scale for the Industires such as Healthcare and Automotive , the AI ladder and AI life cycle and infrastructure architecture considerations.
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
Spark is a powerful new tool for processing large volumes of data quickly across a cluster of networked computers.
Typical bioinformatics workflow requirements are well-matched to Spark’s capabilities. However, Spark is not commonly used because many legacy bioinformatics applications make assumptions about their computing environment. These assumptions present a barrier to integrating the tools into more modern computing environments.
These barriers are quickly coming down. ADAM is a software library and set of tools built on top of Spark that make it easy work with file formats commonly used for genome analysis like FastQ, BAM, and VCF.
In this presentation, we’ll explore how a step that is common to many bioinformatics workflows, sequence alignment, can done with Bowtie and ADAM inside a Spark environment to quickly align short reads to a reference genome. A complete code example is demonstrated and provided at https://github.com/allenday/spark-genome-alignment-demo
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
Personalized medicine holds much promise to improve the quality of human life.
However, personalizing medicine depends on genome analysis software that does not scale well. Given the potential impact on society, genomics takes first place among fields of science that can benefit from Hadoop.
A single human genome contains about 3 billion base pairs. This is less than 1 gigabyte of data but the intermediate data produced by a DNA sequencer, required to produce a sequenced human genome, is many hundreds of times larger. Beyond the huge storage requirement, deep genomic analysis across large populations of humans requires enormous computational capacity as well.
Interestingly enough, while genome scientists have adopted the concept of MapReduce for parallelizing I/O, they have not embraced the Hadoop ecosystem. For example, the popular Genome Analysis Toolkit (GATK) uses a proprietary MapReduce implementation that can scale vertically but not horizontally.
The science driving genomic analyses is rapidly changing, but the operational problems of processing data from DNA sequencers quickly and reliably are not new.
I present an analysis of the parallels in the fundamental limiting components of the '90s internet boom and the DNA sequencing boom that is currently underway, and illustrate how Hadoop, a proven application architecture used widely in BigData and commercial internet applications can be reused in the genomics sector.
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
Renaissance in Medicine: Next-Generation Big Data Workloads
Instead of using 1s and 0s (base2), biological software is encoded as A, T, C, and G (base4). DNA sequencers are simply devices for converting information encoded in base4 to base2. Improvements in DNA sequencing technology are happening at a rate that outstrips even Moore’s Law of Computing. As a result, the number of human genomes converted to base2 and uploaded for analysis is rapidly increasing.
Medicine is undergoing a renaissance made possible by analyzing and creating insights from this huge and growing number of genomes. Personalized medicine is simply the practical application of these insights.
In this session, I will show how ETL and MapReduce can be applied in a clinical setting. I will also show how NoSQL and advanced analytics can be used to “reverse engineer” the genetic causes of disease. Such information can be used to predict and prevent individual suffering, as well as to increase the overall health of a society.
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
First draft for upcoming Hadoop World presentation "Renaissance in Medicine" that gives an overview of the upcoming changes in medical practice that are enabled by BigData technologies. Specific algorithmic techniques are detailed that enable this use case.
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
Architecting R into the Storm Application Development Process
~~~~~
The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype.
In this presentation, Allen will build a bridge from basic real-time business goals to the technical design of solutions. We will take an example of a real-world use case, compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution.
Q: Can I simply hire one rockstar data scientist to cover all this kind of work?
A: No, interdisciplinary work requires teams
A: Hire leads who can speak the lingo of each required discipline
A: Hire individual contributors who cover 2+ roles, when possible
Statistical Thinking – Solve the Whole Problem
BONUS: Meta Organization – Integration with Adjacent Teams
Co-authors Allen Day @allenday and Paco Nathan @pacoid
At Taste Of Middle East, we believe that food is not just about satisfying hunger, it's about experiencing different cultures and traditions. Our restaurant concept is based on selecting famous dishes from Iran, Turkey, Afghanistan, and other Arabic countries to give our customers an authentic taste of the Middle East
Piccola Cucina is regarded as the best restaurant in Brooklyn and as the best Italian restaurant in NYC. We offer authentic Italian cuisine with a Sicilian touch that elevates the entire fine dining experience. We’re the first result when someone searches for where to eat in Brooklyn or the best restaurant near me.
Roti Bank Hyderabad: A Beacon of Hope and NourishmentRoti Bank
One of the top cities of India, Hyderabad is the capital of Telangana and home to some of the biggest companies. But the other aspect of the city is a huge chunk of population that is even deprived of the food and shelter. There are many people in Hyderabad that are not having access to
Ang Chong Yi Navigating Singaporean Flavors: A Journey from Cultural Heritage...Ang Chong Yi
In the heart of Singapore, where tradition meets modernity, He embarks on a culinary adventure that transcends borders. His mission? Ang Chong Yi Exploring the Cultural Heritage and Identity in Singaporean Cuisine. To explore the rich tapestry of flavours that define Singaporean cuisine while embracing innovative plant-based approaches. Join us as we follow his footsteps through bustling markets, hidden hawker stalls, and vibrant street corners.
Key Features of The Italian Restaurants.pdfmenafilo317
Filomena, a renowned Italian restaurant, is renowned for its authentic cuisine, warm environment, and exceptional service. Recognized for its homemade pasta, traditional dishes, and extensive wine selection, we provide a true taste of Italy. Its commitment to quality ingredients and classic recipes has made it a adored dining destination for Italian food enthusiasts.
4. Generate Marker Fingerprint
Select & Recombine Sample tissue
Breeding
Genotyping Lab
Extract DNAAnalyze & Model Data
Grow
Marker-Assisted Breeding Rapidly Increases Frequency of
Favorable Genes
Cloud ML
TensorFlow
5. AI & ML
what you need to know
Machine Learning:
Make Machines
Learn
Artificial Intelligence:
Make Intelligent
Machines
programming a computer
to be intelligent is hard
programming a computer
to learn to be intelligent
is easier and progress is
measurable
6. * Human Performance
based on analysis done
by Andrej Karpathy.
More details here.
Image understanding is (getting) better than human level
ImageNet Challenge: Given
an image, predict one of
1000+ of classes
%errors
7. Deep Neural Networks: Algorithms that Learn
● Modernization of artificial neural networks
● Made of of simple mathematical units,
organized in layers, that together can
compute some (arbitrary) function
● more layers = deeper = more general
● Learn from raw, heterogeneous data
8. “Given an image,
predict one of
1000+ of classes”
Image credit:
360phot0.blogspot.com
ImageNet
Challenge
9. Released in Nov. 2015
#1
repository
for “machine learning”
category on GitHub
TensorFlow
11. Transfer Learning
Quickly able to Learn New Concepts
“t-rex”“quidditch”
Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images
13. Generate Marker Fingerprint
Select & Recombine Sample tissue
Breeding
Genotyping Lab
Extract DNAAnalyze & Model Data
Grow
Marker-Assisted Breeding Rapidly Increases Frequency of
Favorable Genes
Cloud ML
TensorFlow
14. Genomics & Genetics Problems:
How to Start Applying DNNs?
Must-haves for deep learning:
● Lots of data: >50k examples, >1M examples ideal
● High-quality input and labels for training
● Label ~ F(data) unknown but certainly function exists
● High-quality prev. efforts so we know that DNNs are key
○ i.e. hard to solve with classical statistical
approaches
SNP and indel calling from NGS data
17. Creating a universal SNP and small indel
variant caller with deep neural networks
Ryan Poplin, Cory McLean, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy,
Sam Gross, Madeleine Cule, Peyton Greenside, Justin Zook, Marc Salit, Mark
DePristo, Verily Life Sciences, October 2016
18. DNN (Inception V3) Predicts True Genotype from Pileup Images
{ 0.001, 0.994, 0.005 }
{ 0.001, 0.990, 0.009 }
{ 0.000, 0.001, 0.999 }
{ 0.600, 0.399, 0.001 }
Output:
Probability of diploid
genotype states
{ HOM_REF, HET, HOM_VAR }
Raw pixels
Input:
Millions of labeled pileup
images from gold standard
samples
19. DeepVariant #1 in PrecisionFDA Truth Challenge
v2 => v3 truth set
for unblinded
sample
Unblinded =>
blinded sample with
v3 truth set
99.85
99.70
98.91
24. Public Datasets Project
https://cloud.google.com/bigquery/public-data/
A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a
special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications.
Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the
queries that you perform on the data (the first 1TB per month is free)
30. Google confidential │ Do not distribute
Google is good at handling massive volumes of data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
31. Google confidential │ Do not distribute
Google can Handle Massive Amounts of Genomic Data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
~6 Maize WGS
>100x US PhDs
~1M WGS
0.25s
33. New Public Dataset: 1K Cannabis
cloud.google.com/bigquery/public-data/1000-cannabis
Blog Post @ Medium:
DNA Sequencing of 1K Cannabis Strains publicly available in Google BigQuery
Open Source:
https://github.com/allenday/bfx-seq
Revise
Models
DNA
Reads
34. Build What’s Next
Thank You!
Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience