Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VariantSpark - a Spark library for genomics

2,013 views

Published on

VariantSpark a customer Apache Spark library for genomic data. Customer wide random forest machine learning algorithm, designed for workloads with millions of features.

Published in: Science
  • Be the first to comment

  • Be the first to like this

VariantSpark - a Spark library for genomics

  1. 1. VariantSpark: a library for Genomics Transformational Bioinformatics | Denis C. Bauer | @allPowerde Lynn Langit
  2. 2. “Genomical” Big Data
  3. 3. Natalie Twine Transformational Bioinformatics Team Transformational Bioinformatics | Denis C. Bauer | @allPowerde Denis Bauer Oscar Luo Rob Dunne Piotr SzulAidan O’BrienLaurence Wilson Adrian White Mia Champion Gaetan Burgio Collaborators David Levy News Software Dan Andrews Kaitao Lai Kaylene Simpson Iva Nikolic Ian Blair Kelly Williams
  4. 4. BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4) Cited 4 VariantSpark | Denis C. Bauer @allPowerde
  5. 5. Unsupervised ML : K-Means www.cloudaccess.eu 1000 x 40 Million variants Matrix * k-means Predict super population 4 14 ethnic groups and s u p e r populations VariantSpark | Denis C. Bauer @allPowerde * VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants
  6. 6. Comparing K-Means Implementations 0 1000 2000 Python R H adoop Adam AD M IXTU R E VariantSpark method timeinseconds task binary−conversion clustering pre−processing 103 75 29 28 18 4 min VariantSpark | Denis C. Bauer @allPowerde
  7. 7. Supervised ML: Wide Random Forests Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  8. 8. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Genomic Research Workflow https://www.projectmine.com/about/ Focus
  9. 9. Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  10. 10. Scaling to 50 M variables and 10 K samples Transformational Bioinformatics | Denis C. Bauer | @allPowerde 100K trees: 5 – 50h AWS: ~$215.50 100K trees: 200 – 2000h AWS: ~ $ 8620.00 • Yarn Cluster (12 workers) • 16 x Intel Xeon E5-2660@2.20GHz CPU • 128 GB of RAM • Spark 1.6.1 on YARN • 128 executors • 6GB / executor (0.75TB) • Synthetic dataset (mtry = 0.25) Whole Genome Range GWAS Range
  11. 11. Databricks & VariantSpark via a Jupyter notebook
  12. 12. Solving Important Questions… Cancer genomics?
  13. 13. DEMO: Who is a Hipster?
  14. 14. • Quickly access a managed Spark cluster - AWS EC2 / spot instances • Link to your data and perform whole genome analysis in real-time VariantSpark & Databricks Notebooks Transformational Bioinformatics | Denis C. Bauer | @allPowerde Jupyter Notebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  15. 15. Joint-loci association test Hipster-Index = ((2 + GT[B6]) * (1.5 + GT[R1])) + ((0.5 + GT[C2]) * (1 + GT[B2])) Label = 1 if Hipster-Index>10 Genomic profile Label Samples(n=2500) Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  16. 16. Try it out: VariantSpark Notebook https://databricks.com/blog/2017/07/26/breaking-the- curse-of-dimensionality-in-genomics-using-wide- random-forests.html
  17. 17. VariantSpark: a library for Genomics Transformational Bioinformatics | Denis C. Bauer | @allPowerde Lynn Langit

×