Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VariantSpark on AWS

268 views

Published on

Slides from Broad Institute SoftEng retreat poster session - moving CSIRO bioinformatics team's VariantSpark algorithm on AWS using EMR or EKS

Published in: Technology
  • Be the first to comment

VariantSpark on AWS

  1. 1. Lynn Langit for CSIRO Bioinformatics VariantSpark on AWS
  2. 2. Denis Bauer, PhD Oscar Luo, PhD Rob Dunne, PhD Piotr SzulAidan O’BrienLaurence Wilson, PhD Adrian White Andy Hindmarch David Levy Dan Andrews Kaitao Lai, PhD Arash Bayat PhD John Hildebrandt Mia Chapman Ian Blair Kelly Williams Jules Damji Gaetan Burgio Lynn Langit Jim Counts Matthew Jones Natalie Twine, PhD Prabha Pillay Transformational Bioinformatics Team www.csiro.au Denis C. Bauer | @allPowerde
  3. 3. BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4) Cited 4 Denis C. Bauer | @allPowerde
  4. 4. VariantSpark works with VCF Data
  5. 5. Supervised ML: Wide Random Forests
  6. 6. Custom Splits • Horizontal • Vertical Gini Scoring • Local • Global
  7. 7. Performance – Faster and More Accurate VariantSpark can scale to 100% of the genome low Accuracy high lowSpeedhigh
  8. 8. Scaling to 50M variables & 10K samples 100K trees: 5 – 50h AWS: ~$215.50 100K trees: 200 – 2000h AWS: ~ $ 8620.00 • Yarn Cluster • 12 workers • 16 x Intel CPUs • Xeon E5-2660@2.20GHz • 128 GB RAM • Spark 1.6.1 • 128 executors • 6GB / executor 0.75TB • Synthetic dataset Whole Genome Range GWAS Range
  9. 9. Building a Cloud Data Pipeline Spark •IaaS, PaaS, SaaS Vendors •Alibaba, AWS, GCP…
  10. 10. True CloudCosts
  11. 11. CloudCompute Services Choices
  12. 12. Moving to the Cloud – SaaS / Databricks
  13. 13. Synthetic Phenotype: Who is a Bondi Hipster?
  14. 14. Example Notebook: Databricks
  15. 15. Hello VariantSpark via Hipster-Index BUILDS Community
  16. 16. Try it out SaaS: VariantSpark Notebook
  17. 17. Transformational Bioinformatics | @allPowerde
  18. 18. Spark Server Cluster Pipeline Pattern Jupyter Notebook Data Lake
  19. 19. Configuration Challenges
  20. 20. Try it out: VariantSpark on AWS EMR
  21. 21. Try it out: VariantSpark on AWS EKS
  22. 22. Apache Spark 2.3+ with Kubernetes
  23. 23. Try it out: VariantSpark on AWS EKS
  24. 24. Try it out: VariantSpark on AWS EKS
  25. 25. CSIRO Team Trains Other Researchers Team creates reproducible cloud environments • AWS CloudFormation Templates for EMR • Setup screencasts for Databricks and AWS • Scripts and recommended parameters
  26. 26. Next Steps • VariantSpark on GCP • Use GCP DataProc – compare to AWS EMR • Use GCP GKE – compare to AWS EKS (K8) • VariantSpark on Terra.bio • First optimize container for GCP raw compute • Write WDL for VariantSpark tool/workflow • Publish on Dockstore as Tool/Workflow • Publish example Jupyter notebooks • Publish example Terra.bio VariantSpark workflow
  27. 27. Bioinformatics Tools on AWS Lynn Langit for CSIRO Bioinformatics

×