Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep learning in medicine: An introduction and applications to next-generation sequencing and disease diagnostics


Published on

Deep learning has enabled dramatic advances in image recognition performance. In this talk I will discuss using a deep convolutional neural network to detect genetic variation in aligned next-generation sequencing human read data. Our method, called DeepVariant, both outperforms existing genotyping tools and generalizes across genome builds and even to other species. DeepVariant represents a significant step from expert-driven statistical modeling towards more automatic deep learning approaches for developing software to interpret biological instrumentation data.

Published in: Science

Deep learning in medicine: An introduction and applications to next-generation sequencing and disease diagnostics

  1. 1. Confidential + Proprietary Deep learning in medicine: An introduction and applications to next-generation sequencing and disease diagnostics Allen Day, PhD,, Twitter @allenday
  2. 2. Brain DeepMind Cloud Healthcare Verily Calico Google/Alphabet teams involved in healthcare
  3. 3. Confidential + Proprietary The basics of ML
  4. 4. Confidential & Proprietary Observation: programming a computer to be clever is harder than programming a computer to learn to be clever. Intro to machine learning and deep learning
  5. 5. Confidential & Proprietary Traditional Machine Learning...vs the new way The old way: Write a computer program with explicit rules to follow if email contains V!agrå then mark is-spam; if email contains … if email contains … The new way: Write a computer program to learn from examples try to classify some emails; change self to reduce errors; repeat;
  6. 6. Confidential + Proprietary Proprietary & Confidential Deep Neural Networks Step 1: training
  7. 7. Confidential + Proprietary Proprietary & Confidential Deep Neural Networks Step 2: inference
  8. 8. [Tiger-Dog]: 0.9890 [Tiger] : 0.9791 [Dog] : 0.9311 [Pet] : 0.8139 [Fence] : 0.7998 … [ゴジラ  ]: 0.0120 Proprietary & Confidential
  10. 10. Confidential & Proprietary “cat” Deep Learning Revolution Modern Reincarnation of Artificial Neural Networks Collection of simple trainable mathematical units, organized in layers, that work together to solve complicated tasks Key Benefit Learns features from raw, heterogeneous data No explicit feature engineering required What’s New layered network architecture, new training math, *scale*
  11. 11. Proprietary & Confidential Accuracy Scale (data size, model size) 1980s and 1990s neural networks other approaches
  12. 12. Proprietary & Confidential more computeAccuracy Scale (data size, model size) neural networks other approaches 1980s and 1990s
  13. 13. Proprietary & Confidential more computeAccuracy Scale (data size, model size) neural networks other approaches Now
  14. 14. Szegedy et al, 2014 “Inception” Module. Auxiliary Classifiers Pr(dog) GoogLeNet (aka “Inception”) Architecture Main Classifier Proprietary & Confidential
  15. 15. Confidential & Proprietary* Human Performance based on analysis done by Andrej Karpathy. More details here. %errors Year Image understanding is getting better than human level ImageNet Challenge: Given an image, predict one of 1000+ classes
  16. 16. Confidential & Proprietary Search Search ranking Speech recognition Gmail Smart Reply Spam classification Photos Photos search Translate text, graphic, and speech translations Android Keyboard & speech input Drive Intelligence in Apps YouTube Video recommendations Better thumbnails Cardboard Smart stitching Play App recommendations Game developer experience Ads Richer Text Ads Automated Bidding Chrome Search by Image Maps Street View image Parsing Local Search Machine learning has transformed Google’s products
  17. 17. Confidential + Proprietary Google in Health
  18. 18. Confidential + Proprietary Medical applications of deep learning technology ● Deep learning has remarkable efficacy ○ Amazing with images: photos, search, streetview, Android cameras, … ○ And with speech, language, data centers, … ● How and where can we apply this in medicine and biotechnology? ○ Medical imaging: ophthalmology, pathology, ... ○ Genomics ○ ...
  19. 19. Confidential + ProprietaryConfidential + Proprietary Diabetes causes blindness 5-10% of population is diabetic Should be screened annually for diabetic retinopathy Fastest growing cause of blindness # Diabetics >> qualified graders ● 387M diabetics, 200k ophthalmologists ● Grading is highly technical Poor adherence to care plan ● No symptoms, preventive not curative ● 30-50% screened in US ● 10% in high risk populations ● Many lost to follow up
  20. 20. Confidential + Proprietary How DR is Diagnosed: Retinal Fundus Images Healthy Diseased Hemorrhages No DR Mild DR Moderate DR Severe DR Proliferative DR
  21. 21. Confidential + Proprietary Even when available, ophthalmologists are not consistent... Consistency: intragrader ~65%, intergrader ~60% Ophthalmologist Graders Patient Images
  22. 22. Confidential + Proprietary Adapt deep neural network to read fundus images Conv Network - 26 layers No DR Mild DR Moderate DR Severe DR Proliferative DR Labeling tool 54 ophthalmologists 130k images 880k diagnoses
  23. 23. Confidential + Proprietary 0.95 F-score Algorithm Ophthalmologist (median) 0.91 “The study by Gulshan and colleagues truly represents the brave new world in medicine.” “Google just published this paper in JAMA (impact factor 37) [...] It actually lives up to the hype.” Dr. Andrew Beam, Dr. Isaac Kohane Harvard Medical School Dr. Luke Oakden-Rayner University of Adelaide
  24. 24. Confidential + Proprietary Digital pathology JAMA. 2015; 313(11):1122-1132 Correct diagnosis 87% 48% 84% 96% 75% Example: Breast Cancer Biopsies Overdiagnosis Underdiagnosis 1 in 12 breast cancer biopsies is misdiagnosed (population adjusted) Similar for other cancer types (prostate 1 in 7, etc)
  25. 25. Confidential + Proprietary Detecting breast cancer metastases in lymph nodes detail ←→ context Multi scale model resembles microscope magnifications ● Goal: train a deep learning model to identify cancerous cells in pathology slide images ● Output: a map over the whole image, indicating the probability that each region harbors cancer cells ● Trained on ~23M images patches extracted from gigapixel slide images of normal (n=127) and cancerous (n=88) tissues from Camelyon16 dataset
  26. 26. Confidential + Proprietary Tumor localization score (FROC) of 0.89 vs 0.73 for pathologist with unlimited time (92% sensitivity with 8 false positives per slide vs. 73% sensitivity with 0 false positives per slide) Slide level classification of AUC of 0.96 (on par with pathologist) Predicted RegionsGround truth MaskOriginal Slide Metastatic cell detection results are encouraging Cancer cells Read more at
  27. 27. Confidential + Proprietary Deep learning in genomics New application area Example papers: Alipanahi et al (2015), Park Y, Kellis M (2015); Xiong et al (2015); Zhou, Troyanskaya (2015); Angermueller et al (2016) Deep learning to call variants Goals: (1) replace statistical machinery with single deep learning model; (2) state-of-the-art or better performance; (3) generalize to new technologies. Start with human germline Use the germline case to figure out deep learning data representation and models. Extend the approach to somatic mutations, non-human, etc.. Variant calling Key challenge in genomics due to complex errors of NGS technologies. Current error rates vary from <1% for germline SNPs to >25% somatic indels.
  28. 28. Confidential + Proprietary Where should we get started applying deep learning to genetics and genomics problems? Must-haves for deep learning ● Lots of data: >50k examples, >1M ideal. ● High-quality input data and labels for training. ● The mapping from data=>label is unknown but certainly exists. ● High-quality previous efforts so we know that deep learning is key. ○ i.e., hard to solve with classical statistical/ML approaches. SNP and indel calling from NGS data
  29. 29. Confidential + Proprietary Figuring out the true genome sequence from NGS data is a computational and statistical challenge .......... cttgggttga tattgtcttg gaacatggag gttgtgtcac cgtaatggca caggacaaac cgactgtcga catagagctg gttacaacaa cagtcagcaa catggcggag gtaagatcct actgctatga ggcatcaata tcagacatgg cttcggacag .......... True genome sequence: 3 billion bases in 23 contiguous chunks (chromosomes) Actual sequencer output: ~1 billion ~100 basepair long DNA reads (30x coverage) Reference: ...ttgtcttggaacatggaggttgtgtcaccgtaatggcacaggacaaacc... Read1: ...ttgtcttggaacatggaggttgtgtgaccgtaatggcacaggacaaacc Read2: ...ttgtcttggaacatggaggttgtgtgaccgtaatggcacaggacaaacc... Read3: tggaacatggaggttgtgtgaccgtaatggcacaggacaaacc... Align reads to a reference genome Infer the true genomic sequence(s)* Step 1 Step 2 Read1: cttgggttgatattgtcttggaacatggaggttgtgtcaccgtaatggcacaggacaaacc Read2: gatattgtcttggaacatggaggttgtgtcaccgtaatggcacaggacaaaccgactgtcg Read3: tggaacatggaggttgtgtcaccgtaatggcacaggacaaaccgactgtcgacatagagct Read4: ggttgtgtcaccgtaatggcacaggacaaaccgactgtcgacatagagctggttactgtcg .... Read 1,000,000,000: ....caactgtcgacatagagctggttactgtcgacatagagctggtt Reads aligned to a reference genome Same as reference Same as reference
  30. 30. Confidential + Proprietary A complex error process makes it difficult to call variants accurately in NGS data Errors come from many uncontrollable sources Quality of the sample DNA Protocol used to prepare sample for the sequencer From physical properties of instrument itself Data processing artifacts Errors are correlated among the reads The most accurate variant callers, such as the GATK, use multiple techniques, e.g. ● Logistic regression ● Hidden Markov Models ● Bayesian inference ● Gaussian mixture models All make approximations known to be invalid Existing statistical techniques work okay... ...but have well-known drawbacks Rely on hand-crafted features Hand optimized parameters Require years of work by domain experts Specialized to specific prep, sequencer, tool chain, etc Hard to generalize to new technologies
  31. 31. Confidential + Proprietary Other features ACGTGCCCCAAACGTGATGATC ACGTGCCCCAACC--------- --GTGCCCCAAACGT------- ----GCCCCAAACGTGA----- -------CCAACCGTGATG--- --------CAAACGTGATGATC ----------ACCGTGATGATC Ref Read bases Qualities Pileup image A A A C C C A Reference Reads Candidate site 0.01 0.95 0.04 hom ref het hom alt Heterozygous variant call Genotype likelihoods CNN Find candidate variants Create pileup images Evaluate image and call variants DeepVariant Recasting variant calling for deep learning
  32. 32. Confidential + Proprietary Recasting variant calling for deep learning Encoding is roughly red = {A,C,G,T}; green = {quality score}; blue = {read strand}; alpha = {matches ref genome} True SNPs True Indels False variants Encode reads and reference genome as images
  33. 33. Confidential + Proprietary Recasting variant calling for deep learning Use inception-v3 to call variant genotype Szegedy et al. 2015,
  34. 34. Confidential + Proprietary Genome in a Bottle provides ground truth human variation ● Extensive sequencing by orthogonal methods of single human (NA12878) ● Stringent criteria identify “callable genomic regions” and true variants ○ ~3.7M regions (covering 80% of genome) identified as callable ○ ~2.8M single nucleotide polymorphisms ○ ~350k small insertion/deletions ● Train and test on biological replicates of NA12878 ○ Each germline WGS dataset provides ~3.7M labeled training variants ○ 2.1M true heterozygous variants ○ 1.3M true homozygous variants ○ 215k false positive variants Zook et al. 2014
  35. 35. Confidential + Proprietary DeepVariant works well in our in-house evaluations Train model on training chromosomes Evaluate on held-out chromosomes Call variants Outperforms GATK on human dataMethodology
  36. 36. Confidential + Proprietary Estimated P(error) [Phred-scaled, -10 log10(P(error))] DeepVariant GATK Perfect calibration lineObservedP(error) This is the calibration for heterozygous SNPs but other variant types and genotype states are similar. DeepVariant learns an accurate model of the likelihood function P(genotype | reads)
  37. 37. Confidential + Proprietary DeepVariant learns an accurate model of the likelihood function P(genotype | reads) ● Variants should be correct at the assigned confidence rate to be well-calibrated ● Genotype likelihoods are the critical input to genomic analyses such as imputation, de-novo mutation and association Most callers are overconfident in their likelihoods
  38. 38. Confidential + Proprietary After lots of internal testing, we entered into the public FDA-sponsored PrecisionFDA competition in April 2016 Unblinded training sample Blinded evaluation sample
  39. 39. Confidential + Proprietary 99.85 98.91 DeepVariant won an award at the 2016 PrecisionFDA competition v2 => v3 truth set for unblinded sample Unblinded => blinded sample with v3 truth set F-measure is the harmonic mean of precision and recall.
  40. 40. Confidential + Proprietary A trained DeepVariant model encodes everything needed to call variants, enabling us to apply it in novel contexts Training data Evaluation data F1 b37 chr1-19 b38 chr20-22 99.45% b38 chr1-19 b38 chr20-22 99.47% You can train on one genome build and call variants on another You can train on human data and call mouse data F1 is the harmonic mean of precision and recall. Training data Evaluation data F1 Human chr1-19 Mouse chr18-19 98.29% Mouse chr1-17 Mouse chr18-19 97.84% Call variants on b38 using a model trained on either b37 or b38 with effectively identical quality. Means we can call on a genome build without needing all of the metadata mapped to that build. Robust to protocol differences; human: 50x 2x148bp HiSeq 2500; mouse: 27x 2x100bp GAII. Leverage the larger and better truth data on humans (e.g., ~5M in humans vs. ~700K in mouse) to call variants in other organisms.
  41. 41. Confidential + Proprietary Dataset 10X Chromium 75x WGS Ion AmpliSeq exome PacBio raw reads 40x WGS SOLID SE 85x WGS Illumina TruSeq exome DeepVariant (F1 metric) 99.3% 96.9% 92.7% 86.4% 96.1% Comparator (F1 metric) 98.2% 97.3%1 56.1%2 78.8%3 95.4% Comparator caller Long Ranger TVC samtools GATK ensemble 1 Uses four lanes of data vs. one for DeepVariant; 2 No standard caller exists for this technology for human samples; 3 Old technology without any maintained variant callers. DeepVariant can learn to call variants in many sequencing technologies
  42. 42. Confidential + Proprietary DeepVariant can learn to call variants at a range of input sequence depths Sensitivity Precision Sequencing depth Sequencing depth GATK DV 35-45x DV 4-45x DV 15-25x GATK DV 35-45x DV 4-45x DV 15-25x
  43. 43. Confidential + Proprietary Proprietary & Confidential DeepVariant outperforms GATK on low-coverage samples Training on chromosomes 1-19 Evaluation on chromosomes 20-22
  44. 44. Confidential + Proprietary DeepVariant conclusions ● Deep Learning is a remarkably powerful and flexible technology. ● Example of how to apply deep learning to a genomics problem. ● Equivalent or better performance than current variant calling tools. ● Works for many (any?) sequencing technology. ● Run now at ● Open-sourced version coming soon! ● Read more in our BioRxiv paper
  45. 45. Google confidential │ Do not distribute Google’s Data Research... 2002 2004 2006 2008 2010 2012 2014 2016 GFS MapReduce TensorFlow BigTable Dremel Colossus Flume Megastore Spanner Millwheel PubSub F1
  46. 46. Google confidential │ Do not distribute ...are the technologies used in DeepVariant... 2002 2004 2006 2008 2010 2012 2014 2016 GFS MapReduce TensorFlow BigTable Dremel Colossus Flume Megastore Spanner Millwheel PubSub F1
  47. 47. Google confidential │ Do not distribute ... which are available to you today on GCP 2002 2004 2006 2008 2010 2012 2014 2016 ML PubSub DataFlow DataStore DataFlow Cloud Storage BigQuery BigTable DataProc Cloud Storage
  48. 48. Confidential + ProprietaryConfidential + Proprietary Sharing our tools with researchers and developers around the world repository for “machine learning” category on GitHub #1 TensorFlow released in Nov. 2015
  49. 49. Build What’s Next Thank You! Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience Brain DeepMind Cloud Healthcare Verily Calico