Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - StampedeCon AI Summit 2017


Published on

This talk aims to dive into technical details in machine learning model development, implementation and values it bring to Monsanto breeding pipeline. We genotype over 100 million seeds a year in order to save field resources and product development cycle time. Automation and high throughput production from the lab becomes key to R&D success. In house predictive model development incorporated random forest ensemble based approach with additional features derived from gaussian mixture model. The results show over 95% accuracy with less than 1% false positives/negatives. Model is highly generalizable with over 10 million data points being trained and tested on. The model also offers probabilistic approach to present genotypes in a more meaningful way and help enhanced downstream genomics analyses. The talk targets audience who are in breeding, genetics, molecular biology, and data scientists who are interested in practical applications.

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - StampedeCon AI Summit 2017

  1. 1. Nan Newton Data Scientist, Global IT Analytics StampedeCon 2017, St. Louis MO Novel Semi-Supervised Probabilistic ML approach to SNP Variant Calling
  2. 2. DNA Analysis Through Advanced ML Automated Seed Chipping Performance Evaluation Superior Seeds Selected Today’s Digital Plant Breeding is Powered by Our Knowledge of Genetics & Advanced Analytics LABFIELD
  3. 3. Single Nucleotide Polymorphism (SNP) Variant detects seeds with desired traits C T
  4. 4. SNP or Molecular Markers serve as signposts Monsanto Company Confidential4
  5. 5. Genotypes-Phenotype Association helps breeders reduce spending on field resources by only selecting seeds with desired phenotypes Parent generation P1 P2 CC TT F1 generation CT CT CT CT F2 generation CC CT CT TT CC CT CT TT Homozygous C Heterozygous Homozygous T Genotypes Goal: Predict Genotypes for any seeds
  6. 6. Genotypes Detection through Molecular Biology knowledge in high throughput genotyping labs Seeds sent to lab Part of seeds are chipped DNA molecule obtained for each seed A A T C A T G T A A T C A T G T allele1 allele2 A A C T A C G A A A C T A C G A allele1 allele2 Uncoil double helix DNA A A T C A T G T A A T C A T G T allele1 allele2 A A C T A C G A A A C T A C G A allele1 allele2 Add fluorophores FAM FAM VIC VIC Make a bunch of DNA copies to generate stronger signal of fluorophores
  7. 7. Genotypes Calls through fluorescence signals and controls information Controls HOM_FAM HET HOM_VIC MISSING
  8. 8. Plate-to-Plate Variations in regards to clusters behaviors, controls performance, intensities distribution Monsanto Company Confidential8
  9. 9. Impute MISSING label using k-Nearest Neighbor algorithm MISSING Less Confident
  10. 10. Predict FAIL samples using training model from another lab MISSING
  11. 11. Semi-supervised Machine Learning Random Forest Normalized Fluorescence Intensities Create fluorescence- based features Create controls- based features Create positions- based features Create unsupervised clustering based features Predict Probabilistic Genotypes Predict FAIL samples
  12. 12. Probabilistic Genotypes Prediction LOWER CONFIDENCE
  13. 13. Model Scalability and Extensibility: AWS Cloud Integration with Enterprise Digital Architecture Input Data from any databases Breeding Biotech Supply Chain Customized Training Models Predictive Model Execution
  14. 14. 4
  15. 15. Better data… Better Decisions Linkage Disequilibrium Genotype-Phenotype Association Haplotype Mapping Genetic Mapping Probabilistic impact on downstream genetic analytics aa Aa AA Short Tall
  16. 16. Further Improvement 17 Acknowledgement: Jeff Pobst, Bryan Dannowitz, Chris Schlosberg, Shane Ryerson
  17. 17. Further Improvement 18
  18. 18. Acknowledgement Molecular Breeding Technology Global Breeding Cloud Analytics Global IT Analytics Products & Engineering Lab Platform Product 360 Data Asset Data Science Center of Excellence