Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bioinformatics Data Pipelines built by CSIRO on AWS

744 views

Published on

bioinformatics (genomics) data pipelines built by CSRIO Australia on the AWS Cloud

Published in: Science
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Bioinformatics Data Pipelines built by CSIRO on AWS

  1. 1. Cancer Genomics Data Pipelines Lynn & Samantha Langit CSIRO Bioinformatics / Australia June 2017 - Oslo
  2. 2. 3 Billion data points per patient DNA sample Up to 25% of the population could be sequenced by 2025
  3. 3. Two Perspectives Bioinformatics Research • Insight • Reproducibility Cloud Architecture • Speed • Low Cost • Simplicity
  4. 4. Cloud Data Pipeline Pattern Problem • Define business problem Data • Quality • Quantity Candidate Technologies • Ingest • ETL • Biz Analytics • ML • Visualization Build MVPs • Iterate • Learn Assemble Pipeline • Validate each section • Test at scale
  5. 5. Genomic Sequencing Results CRISPR-Cas9 for molecular engineering technology enables the accurate editing of genomes for researchers. It…  Pattern-matching unique sequences of DNA  Huge demand for large-scale computation  Time-critical dimension to compute  NIH-approved for human health  Could revolutionize cancer treatments
  6. 6. Serverless Lambda Architecture Pattern Lambda function 1 Lambda function 2 Lambda function 3 buckets with objects DynamoDB API Gateway Users
  7. 7. CSIRO: Commonwealth Scientific & Industrial Research Organization
  8. 8. GT-Scan2 Demo GT-Scan2
  9. 9. Scale Genomic Analysis GWAS = genome-wide sequencing data association studies  Analysis on large cohort data or imputed SNP array data  Clustering on genomic profiles to stratify large-cohort genomic data  Viewing datasets with millions of features
  10. 10. Cloud Data Pipeline Pattern Problem • Define business problem Data • Quality • Quantity Candidate Technologies • Ingest • ETL • Biz Analytics • ML • Visualization Build MVPs • Iterate • Learn Assemble Pipeline • Validate each section • Test at scale
  11. 11. Genomics (ML) Pipeline Pattern
  12. 12. What is CSIRO’s solution? For Scale at reasonable cost Use Apache Hadoop For Scale at speed Use Apache Spark for Hadoop For Usability in bioinformatics Create a domain-specific API (OSS library) For global use Leverage Cloud Pipeline Patterns
  13. 13. GWAS Analysis with Variant-Spark On premise Hadoop Cluster with Apache Spark Genomics Analysts corporate data center
  14. 14. What is Apache Spark?
  15. 15. What is variant-spark? Demo
  16. 16. 80% faster than ADAM 90% faster than R 90% faster than Python
  17. 17. VariantSpark Uses Apache Spark to massively parallelize the generation of random forests to identify disease genes efficiently  Analyzes 3,000 samples with 80 million features in < 30 minutes  Enables real-time diagnosis by finding similar patients  Contributes to motor neuron disease (ALS) research in Australia
  18. 18. Data Prep Statistics Probabilistic Algorithms Data Viz Machine Learning…
  19. 19. Spark ML Classification Algorithms Wide Random Forest Ensemble of Decision Trees Logistic Regression variant-spark other libraries
  20. 20. OSS Library variant-spark for all
  21. 21.  usable? performant?  extendable? (clean code)  using the best language (Scala)?  using the ‘best version’ of Spark?  using a version of wide random forests that is understandable? Is it…
  22. 22. How best to Deploy Cloud Hadoop? • IaaS  EC2 instances with Apache Hadoop, Apache Spark, more… • PaaS  Elastic Map Reduce (EMR) Hadoop cluster • SaaS  Vendor-managed, i.e. DataBricks w/Jupyter Notebooks
  23. 23. What is Databricks?
  24. 24. DEMO: Jupyter Notebooks
  25. 25. Variant-Spark and Databricks Demo
  26. 26. Solving Important Questions… Cancer Genomics?
  27. 27. DEMO: Who is a Hipster?
  28. 28.  AWS EC2 Spot Instances
  29. 29. GWAS Analysis with Variant-Spark EC2 Hadoop Cluster with Apache Spark Genomics Analysts Availability Zone 1000 Genomes GWAS input Spot EC2 Hadoop worker instances EC2 Hadoop instances
  30. 30. Cloud Data Pipeline Pattern Problem Data Candidate Technologies Build MVPs Assemble Pipeline Analyze GWAS -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks DBFS Apache Spark Variant-Spark ML Notebook SQL, R or Python SaaS
  31. 31. Cloud Data Pipeline Pattern Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1. Scan vcf -> S3/DynamoDB Ingest ETL Analyze Viz S3 Lambda Lambda Lambda/API Gateway Serverless 2. Analyze GWAS -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks DBFS Apache Spark Variant-Spark ML Notebook SQL, R or Python SaaS
  32. 32. Modern Big Data Pipelines • Problem #1 - Scan • Solution: Serverless Cloud Pipeline • Problem # 2 - Analyze • Solution: SaaS Cloud ML Pipeline
  33. 33. Cancer Genomics Data Pipelines Lynn & Samantha Langit CSIRO Bioinformatics & variant-spark June 2017 - Oslo

×