Dr. Denis Bauer & Lynn Langit
Genomic-scale Data Pipelines
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics Team
Denis Bauer,
PhD
Oscar Luo,
PhD
Rob Dunne,
PhD
Piotr Szul
Team
Aidan O’BrienLaurence Wilson,
PhD
Adrian White
Andy Hindmarch
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai,
PhD
Natalie Twine,
PhD
Arash Bayat
John Hildebrandt
Mia Chapman
Ian Blair
Kelly Williams
Jules Damji
Gaetan Burgio Lynn Langit
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genome holds the blueprint for every cell
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
It affects looks, disease risk, and behavior
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
1
0.17
2
20
0 5 10 15 20 25
Astronomy
Twitter
YouTube
Genomic
GENOMIC Big Data in 2025 - Exabytes
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
VCF Data
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic Research Workflow
https://www.projectmine.com/about/
Focus
Finding the disease gene(s)
Spot the variant that is…
• common amongst all affected
• absent in all unaffected*
* oversimplified
cases
controls
Gene1 Gene2
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CloudDataPipelinePattern
Problem
• Define biz
problem
Data
• Quality
• Quantity
• Location
Candidate
Technologies
• Ingest
• Clean
• Analyze
• Predict
• Visualize
Build MVPs
• Iterate
• Learn
• Assemble
Assemble
Pipeline
• Validate sections
• Test at scale
CloudDataPipelinePattern
Candidate
Technologies
• Ingest
• Clean
• Analyze
• Predict
• Visualize
Build MVPs
• Iterate
• Learn
• Assemble
Assemble
Pipeline
• Validate sections
• Test at scale
Machine Learning Pipeline Pattern
What is CSIRO’s solution?
For Scale at
reasonable cost Use Apache Hadoop
For Scale at
speed Use Apache Spark
For Usability in
bioinformatics Create a domain-specific ML API (library)
For global use
Leverage Cloud Pipeline Patterns
Transformational Bioinformatics| Denis C. Bauer @allPowerde
GWAS Analysis with Variant-Spark
On-premise Cluster
with Apache Hadoop & Spark
Genomics Analysts
CSIRO corporate data center
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Why
Apache
Spark?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Supervised ML: Wide Random Forests
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Solving Important Questions…
Cancer genomics?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
DEMO: Who is a Hipster?
Transformational Bioinformatics| Denis C. Bauer @allPowerde
VariantSpark & Databricks Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
databricks Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Performance – Faster and More Accurate
VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
low Accuracy high
lowSpeedhigh
Scaling to 50 M variables and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster
• 12 workers
• 16 x Intel Xeon E5-2660@2.20GHz CPU
• 128 GB of RAM
• Spark 1.6.1 on YARN
• 128 executors
• 6GB / executor (0.75TB)
• Synthetic dataset
Whole Genome
Range
GWAS Range
Try it out: VariantSpark Notebook
https://databricks.com/blog/2017/07/26/breaking-the-
curse-of-dimensionality-in-genomics-using-wide-
random-forests.html
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Future Directions for VariantSpark RF
Additional feature types
Unordered
Categorical
For Scores -
Continuous
Different feature ranges
Small and Big
Inputs
For Gene
Expression analysis
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genome Editing can correct genetic
diseases, ex. hypertrophic cardiomyopathy
Editing does not work every time, e.g. only
7 in 10 embryos were mutation free
Aim: Develop computational
guidance framework to enable edits
the first time; every time
Ma et al. Nature 2017 *
* Controversy around the paper – stay tuned
Transformational Bioinformatics| Denis C. Bauer @allPowerde
Make process parallel and scalable
• SPEED: Each search can be broken down into parallel tasks to then only take
seconds
• SCALE: Researchers might want to search the target for one gene or 100,000
Scalability + Agility =
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
One of the first Serverless Applications in Research
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Featured in
This is My Architecture
GT-Scan2
Considering Services
for GT-Scan2
• Use AWS Step Functions
• Simplify workflow
• Simplify task timeouts
• Simplify task failures
• Must evaluate costs
• SNS vs. Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CloudDataPipelinePattern
Problem Data
Candidate
Technologies
Build MVPs
Assemble
Pipeline
1. Analyze/GWAS vcf -> S3/Hadoop Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook SQL, R or Python
Spark
2. Search/GTScan2 S3/fastq-> DynamoDB
S3/fastq, bed
Ingest
ETL
Analyze
Viz
S3
Lambda
Lambda
Lambda/API Gateway
Serverless
Spark Pipeline Pattern
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Jupyter Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Serverless Architecture Pattern
Lambda
function
1
Lambda
function
2
Lambda
function
3
buckets with
objects DynamoDB
API Gateway Users
Step Functions
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Cloud Genomic Data Pipelines
• Problem # 1 – Analyze
• Find the mutated genes
• Solution: Spark-based machine learning
• Problem #2 – Scan
• Find the nucleotide (DNA letters)
• Solution: Serverless
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomics Big Data Pipelines
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Dr. Denis Bauer & Lynn Langit

Genomic Scale Big Data Pipelines

  • 1.
    Dr. Denis Bauer& Lynn Langit Genomic-scale Data Pipelines
  • 2.
    Transformational Bioinformatics |Denis C. Bauer | @allPowerde Transformational Bioinformatics Team Denis Bauer, PhD Oscar Luo, PhD Rob Dunne, PhD Piotr Szul Team Aidan O’BrienLaurence Wilson, PhD Adrian White Andy Hindmarch Collaborators David Levy News Software Dan Andrews Kaitao Lai, PhD Natalie Twine, PhD Arash Bayat John Hildebrandt Mia Chapman Ian Blair Kelly Williams Jules Damji Gaetan Burgio Lynn Langit
  • 3.
    1000 17 2000 0 500 10001500 2000 2500 Astronomy Twitter YouTube Big Data in 2025…Petabytes? 1000 17 2000 0 500 1000 1500 2000 2500 Astronomy Twitter YouTube Big Data in 2025…Petabytes? Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 4.
    Genome holds theblueprint for every cell Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 5.
    It affects looks,disease risk, and behavior Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 6.
    1 0.17 2 20 0 5 1015 20 25 Astronomy Twitter YouTube Genomic GENOMIC Big Data in 2025 - Exabytes Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 7.
    VCF Data Transformational Bioinformatics| Denis C. Bauer | @allPowerde
  • 8.
    Transformational Bioinformatics |Denis C. Bauer | @allPowerde Genomic Research Workflow https://www.projectmine.com/about/ Focus
  • 9.
    Finding the diseasegene(s) Spot the variant that is… • common amongst all affected • absent in all unaffected* * oversimplified cases controls Gene1 Gene2 Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 10.
    CloudDataPipelinePattern Problem • Define biz problem Data •Quality • Quantity • Location Candidate Technologies • Ingest • Clean • Analyze • Predict • Visualize Build MVPs • Iterate • Learn • Assemble Assemble Pipeline • Validate sections • Test at scale
  • 11.
    CloudDataPipelinePattern Candidate Technologies • Ingest • Clean •Analyze • Predict • Visualize Build MVPs • Iterate • Learn • Assemble Assemble Pipeline • Validate sections • Test at scale
  • 12.
  • 13.
    What is CSIRO’ssolution? For Scale at reasonable cost Use Apache Hadoop For Scale at speed Use Apache Spark For Usability in bioinformatics Create a domain-specific ML API (library) For global use Leverage Cloud Pipeline Patterns Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 14.
    GWAS Analysis withVariant-Spark On-premise Cluster with Apache Hadoop & Spark Genomics Analysts CSIRO corporate data center Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 15.
  • 16.
    BMC Genomics 2015,16:1052 PMID: 26651996 (IF=4) Cited 4 Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 17.
    Supervised ML: WideRandom Forests Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 18.
    Solving Important Questions… Cancergenomics? Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 19.
    DEMO: Who isa Hipster? Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 20.
    VariantSpark & DatabricksNotebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde databricks Notebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 21.
    Performance – Fasterand More Accurate VariantSpark is the only method to scale to 100% of the genome Transformational Bioinformatics | Denis C. Bauer | @allPowerde low Accuracy high lowSpeedhigh
  • 22.
    Scaling to 50M variables and 10 K samples Transformational Bioinformatics | Denis C. Bauer | @allPowerde 100K trees: 5 – 50h AWS: ~$215.50 100K trees: 200 – 2000h AWS: ~ $ 8620.00 • Yarn Cluster • 12 workers • 16 x Intel Xeon E5-2660@2.20GHz CPU • 128 GB of RAM • Spark 1.6.1 on YARN • 128 executors • 6GB / executor (0.75TB) • Synthetic dataset Whole Genome Range GWAS Range
  • 23.
    Try it out:VariantSpark Notebook https://databricks.com/blog/2017/07/26/breaking-the- curse-of-dimensionality-in-genomics-using-wide- random-forests.html Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 24.
    Future Directions forVariantSpark RF Additional feature types Unordered Categorical For Scores - Continuous Different feature ranges Small and Big Inputs For Gene Expression analysis Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 25.
    Genome Editing cancorrect genetic diseases, ex. hypertrophic cardiomyopathy Editing does not work every time, e.g. only 7 in 10 embryos were mutation free Aim: Develop computational guidance framework to enable edits the first time; every time Ma et al. Nature 2017 * * Controversy around the paper – stay tuned Transformational Bioinformatics| Denis C. Bauer @allPowerde
  • 26.
    Make process paralleland scalable • SPEED: Each search can be broken down into parallel tasks to then only take seconds • SCALE: Researchers might want to search the target for one gene or 100,000 Scalability + Agility = Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 27.
    One of thefirst Serverless Applications in Research Transformational Bioinformatics | Denis C. Bauer | @allPowerde Featured in This is My Architecture
  • 28.
  • 29.
    Considering Services for GT-Scan2 •Use AWS Step Functions • Simplify workflow • Simplify task timeouts • Simplify task failures • Must evaluate costs • SNS vs. Step Functions Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 30.
    CloudDataPipelinePattern Problem Data Candidate Technologies Build MVPs Assemble Pipeline 1.Analyze/GWAS vcf -> S3/Hadoop Ingest ETL Analyze Viz S3 -> Databricks DBFS Apache Spark Variant-Spark ML Notebook SQL, R or Python Spark 2. Search/GTScan2 S3/fastq-> DynamoDB S3/fastq, bed Ingest ETL Analyze Viz S3 Lambda Lambda Lambda/API Gateway Serverless
  • 31.
    Spark Pipeline Pattern TransformationalBioinformatics | Denis C. Bauer | @allPowerde Jupyter Notebook Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 32.
    Serverless Architecture Pattern Lambda function 1 Lambda function 2 Lambda function 3 bucketswith objects DynamoDB API Gateway Users Step Functions Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 33.
    Cloud Genomic DataPipelines • Problem # 1 – Analyze • Find the mutated genes • Solution: Spark-based machine learning • Problem #2 – Scan • Find the nucleotide (DNA letters) • Solution: Serverless Transformational Bioinformatics | Denis C. Bauer | @allPowerde
  • 34.
    Genomics Big DataPipelines Transformational Bioinformatics | Denis C. Bauer | @allPowerde Dr. Denis Bauer & Lynn Langit

Editor's Notes

  • #4 http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  • #5 https://www.genome.gov/18016863/a-brief-guide-to-genomics/ https://www.thinglink.com/scene/617714375666434050
  • #6 http://images.wisegeek.com/woman-in-greek-tank-top-looking-at-thumb.jpg http://nborganics.com.au/index.php/product/herbs-coriander/
  • #7 http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  • #8 http://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40/
  • #20 https://academics.cloud.databricks.com/#notebook/170398/command/170419 – AND-- http://www.drjasonfox.com/
  • #21 Quickly access a managed Spark cluster - AWS EC2 / spot instances Link to your data and perform whole genome analysis in real-time
  • #24 https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html
  • #26 http://www.nature.com/nature/journal/v462/n7276/fig_tab/nature08645_F1.html Bauer et al. Trends Mol Med. 2014 PMID: 24801560.
  • #29 https://www.gt-scan.net/ --AND- AMA with Dr, Bauer -- https://www.reddit.com/r/science/comments/5fiicm/science_ama_series_im_denis_bauer_a_team_leader/
  • #30 Recent team presentation - https://www.slideshare.net/AustralianNationalDataService/gtscan2-bringing-bioinformatics-to-the-cloud-may-tech-talk
  • #32 Quickly access a managed Spark cluster - AWS EC2 / spot instances Link to your data and perform whole genome analysis in real-time