Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Genomic Scale Big Data Pipelines

1,048 views

Published on

deck from talk at YOW Data in Sydney, covers VariantSpark, custom Apache Spark Machine Learning library and also GT-Scan2 using AWS Lambda architecture for bioinformatics

Published in: Science
  • Be the first to comment

Genomic Scale Big Data Pipelines

  1. 1. Dr. Denis Bauer & Lynn Langit Genomic-scale Data Pipelines
  2. 2. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Transformational Bioinformatics Team Denis Bauer, PhD Oscar Luo, PhD Rob Dunne, PhD Piotr Szul Team Aidan O’BrienLaurence Wilson, PhD Adrian White Andy Hindmarch Collaborators David Levy News Software Dan Andrews Kaitao Lai, PhD Natalie Twine, PhD Arash Bayat John Hildebrandt Mia Chapman Ian Blair Kelly Williams Jules Damji Gaetan Burgio Lynn Langit
  3. 3. 1000 17 2000 0 500 1000 1500 2000 2500 Astronomy Twitter YouTube Big Data in 2025…Petabytes? 1000 17 2000 0 500 1000 1500 2000 2500 Astronomy Twitter YouTube Big Data in 2025…Petabytes? Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  4. 4. Genome holds the blueprint for every cell Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  5. 5. It affects looks, disease risk, and behavior Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  6. 6. 1 0.17 2 20 0 5 10 15 20 25 Astronomy Twitter YouTube Genomic GENOMIC Big Data in 2025 - Exabytes Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  7. 7. VCF Data Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  8. 8. Transformational Bioinformatics | Denis C. Bauer | @allPowerde Genomic Research Workflow https://www.projectmine.com/about/ Focus
  9. 9. Finding the disease gene(s) Spot the variant that is… • common amongst all affected • absent in all unaffected* * oversimplified cases controls Gene1 Gene2 Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  10. 10. CloudDataPipelinePattern Problem • Define biz problem Data • Quality • Quantity • Location Candidate Technologies • Ingest • Clean • Analyze • Predict • Visualize Build MVPs • Iterate • Learn • Assemble Assemble Pipeline • Validate sections • Test at scale
  11. 11. CloudDataPipelinePattern Candidate Technologies • Ingest • Clean • Analyze • Predict • Visualize Build MVPs • Iterate • Learn • Assemble Assemble Pipeline • Validate sections • Test at scale
  12. 12. Machine Learning Pipeline Pattern
  13. 13. What is CSIRO’s solution? For Scale at reasonable cost Use Apache Hadoop For Scale at speed Use Apache Spark For Usability in bioinformatics Create a domain-specific ML API (library) For global use Leverage Cloud Pipeline Patterns Transformational Bioinformatics| Denis C. Bauer @allPowerde
  14. 14. GWAS Analysis with Variant-Spark On-premise Cluster with Apache Hadoop & Spark Genomics Analysts CSIRO corporate data center Transformational Bioinformatics| Denis C. Bauer @allPowerde
  15. 15. Why Apache Spark? Transformational Bioinformatics| Denis C. Bauer @allPowerde
  16. 16. BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4) Cited 4 Transformational Bioinformatics| Denis C. Bauer @allPowerde
  17. 17. Supervised ML: Wide Random Forests Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  18. 18. Solving Important Questions… Cancer genomics? Transformational Bioinformatics| Denis C. Bauer @allPowerde
  19. 19. DEMO: Who is a Hipster? Transformational Bioinformatics| Denis C. Bauer @allPowerde
  20. 20. VariantSpark & Databricks Notebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde databricks Notebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  21. 21. Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome Transformational Bioinformatics | Denis C. Bauer | @allPowerde low Accuracy high lowSpeedhigh
  22. 22. Scaling to 50 M variables and 10 K samples Transformational Bioinformatics | Denis C. Bauer | @allPowerde 100K trees: 5 – 50h AWS: ~$215.50 100K trees: 200 – 2000h AWS: ~ $ 8620.00 • Yarn Cluster • 12 workers • 16 x Intel Xeon E5-2660@2.20GHz CPU • 128 GB of RAM • Spark 1.6.1 on YARN • 128 executors • 6GB / executor (0.75TB) • Synthetic dataset Whole Genome Range GWAS Range
  23. 23. Try it out: VariantSpark Notebook https://databricks.com/blog/2017/07/26/breaking-the- curse-of-dimensionality-in-genomics-using-wide- random-forests.html Transformational Bioinformatics| Denis C. Bauer @allPowerde
  24. 24. Future Directions for VariantSpark RF Additional feature types Unordered Categorical For Scores - Continuous Different feature ranges Small and Big Inputs For Gene Expression analysis Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  25. 25. Genome Editing can correct genetic diseases, ex. hypertrophic cardiomyopathy Editing does not work every time, e.g. only 7 in 10 embryos were mutation free Aim: Develop computational guidance framework to enable edits the first time; every time Ma et al. Nature 2017 * * Controversy around the paper – stay tuned Transformational Bioinformatics| Denis C. Bauer @allPowerde
  26. 26. Make process parallel and scalable • SPEED: Each search can be broken down into parallel tasks to then only take seconds • SCALE: Researchers might want to search the target for one gene or 100,000 Scalability + Agility = Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  27. 27. One of the first Serverless Applications in Research Transformational Bioinformatics | Denis C. Bauer | @allPowerde Featured in This is My Architecture
  28. 28. GT-Scan2
  29. 29. Considering Services for GT-Scan2 • Use AWS Step Functions • Simplify workflow • Simplify task timeouts • Simplify task failures • Must evaluate costs • SNS vs. Step Functions Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  30. 30. CloudDataPipelinePattern Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks DBFS Apache Spark Variant-Spark ML Notebook SQL, R or Python Spark 2. Search/GTScan2 S3/fastq-> DynamoDB S3/fastq, bed Ingest ETL Analyze Viz S3 Lambda Lambda Lambda/API Gateway Serverless
  31. 31. Spark Pipeline Pattern Transformational Bioinformatics | Denis C. Bauer | @allPowerde Jupyter Notebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  32. 32. Serverless Architecture Pattern Lambda function 1 Lambda function 2 Lambda function 3 buckets with objects DynamoDB API Gateway Users Step Functions Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  33. 33. Cloud Genomic Data Pipelines • Problem # 1 – Analyze • Find the mutated genes • Solution: Spark-based machine learning • Problem #2 – Scan • Find the nucleotide (DNA letters) • Solution: Serverless Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  34. 34. Genomics Big Data Pipelines Transformational Bioinformatics | Denis C. Bauer | @allPowerde Dr. Denis Bauer & Lynn Langit

×