Genomic Scale Big Data Pipelines

Dr. Denis Bauer & Lynn Langit
Genomic-scale Data Pipelines

Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Transformational Bioinformatics Team
Denis Bauer,
PhD
Oscar Luo,
PhD
Rob Dunne,
PhD
Piotr Szul
Team
Aidan O’BrienLaurence Wilson,
PhD
Adrian White
Andy Hindmarch
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai,
PhD
Natalie Twine,
PhD
Arash Bayat
John Hildebrandt
Mia Chapman
Ian Blair
Kelly Williams
Jules Damji
Gaetan Burgio Lynn Langit

1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?
1000
17
2000
0 500 1000 1500 2000 2500
Astronomy
Twitter
YouTube
Big Data in 2025…Petabytes?

Genome holds the blueprint for every cell

It affects looks, disease risk, and behavior

1
0.17
2
20
0 5 10 15 20 25
Astronomy
Twitter
YouTube
Genomic
GENOMIC Big Data in 2025 - Exabytes

VCF Data

Genomic Research Workflow
https://www.projectmine.com/about/
Focus

Finding the disease gene(s)
Spot the variant that is…
• common amongst all affected
• absent in all unaffected*
* oversimplified
cases
controls
Gene1 Gene2

CloudDataPipelinePattern
Problem
• Define biz
problem
Data
• Quality
• Quantity
• Location
Candidate
Technologies
• Ingest
• Clean
• Analyze
• Predict
• Visualize
Build MVPs
• Iterate
• Learn
• Assemble
Assemble
Pipeline
• Validate sections
• Test at scale

Candidate
Technologies
• Ingest
• Clean
• Analyze
• Predict
• Visualize
Build MVPs
• Iterate
• Learn
• Assemble
Assemble
Pipeline
• Validate sections
• Test at scale

Machine Learning Pipeline Pattern

What is CSIRO’s solution?
For Scale at
reasonable cost Use Apache Hadoop
For Scale at
speed Use Apache Spark
For Usability in
bioinformatics Create a domain-specific ML API (library)
For global use
Leverage Cloud Pipeline Patterns
Transformational Bioinformatics| Denis C. Bauer @allPowerde

GWAS Analysis with Variant-Spark
On-premise Cluster
with Apache Hadoop & Spark
Genomics Analysts
CSIRO corporate data center

Why
Apache
Spark?

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4

Supervised ML: Wide Random Forests

Solving Important Questions…
Cancer genomics?

DEMO: Who is a Hipster?

VariantSpark & Databricks Notebook
databricks Notebook

Performance – Faster and More Accurate
VariantSpark is the only method to scale to 100% of the genome
low Accuracy high
lowSpeedhigh

Scaling to 50 M variables and 10 K samples
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster
• 12 workers
• 16 x Intel Xeon E5-2660@2.20GHz CPU
• 128 GB of RAM
• Spark 1.6.1 on YARN
• 128 executors
• 6GB / executor (0.75TB)
• Synthetic dataset
Whole Genome
Range
GWAS Range

Try it out: VariantSpark Notebook
https://databricks.com/blog/2017/07/26/breaking-the-
curse-of-dimensionality-in-genomics-using-wide-
random-forests.html

Future Directions for VariantSpark RF
Additional feature types
Unordered
Categorical
For Scores -
Continuous
Different feature ranges
Small and Big
Inputs
For Gene
Expression analysis

Genome Editing can correct genetic
diseases, ex. hypertrophic cardiomyopathy
Editing does not work every time, e.g. only
7 in 10 embryos were mutation free
Aim: Develop computational
guidance framework to enable edits
the first time; every time
Ma et al. Nature 2017 *
* Controversy around the paper – stay tuned

Make process parallel and scalable
• SPEED: Each search can be broken down into parallel tasks to then only take
seconds
• SCALE: Researchers might want to search the target for one gene or 100,000
Scalability + Agility =

One of the first Serverless Applications in Research
Featured in
This is My Architecture

Considering Services
for GT-Scan2
• Use AWS Step Functions
• Simplify workflow
• Simplify task timeouts
• Simplify task failures
• Must evaluate costs
• SNS vs. Step Functions

Problem Data
Candidate
Technologies
Build MVPs
Assemble
Pipeline
1. Analyze/GWAS vcf -> S3/Hadoop Ingest
ETL
Analyze
Viz
S3 -> Databricks DBFS
Apache Spark
Variant-Spark ML
Notebook SQL, R or Python
Spark
2. Search/GTScan2 S3/fastq-> DynamoDB
S3/fastq, bed
Ingest
ETL
Analyze
Viz
S3
Lambda
Lambda
Lambda/API Gateway
Serverless

Spark Pipeline Pattern
Jupyter Notebook

Serverless Architecture Pattern
Lambda
function
1
Lambda
function
2
Lambda
function
3
buckets with
objects DynamoDB
API Gateway Users
Step Functions

Cloud Genomic Data Pipelines
• Problem # 1 – Analyze
• Find the mutated genes
• Solution: Spark-based machine learning
• Problem #2 – Scan
• Find the nucleotide (DNA letters)
• Solution: Serverless

Genomics Big Data Pipelines
Dr. Denis Bauer & Lynn Langit

Genomic Scale Big Data Pipelines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Genomic Scale Big Data Pipelines

Similar to Genomic Scale Big Data Pipelines (20)

More from Lynn Langit

More from Lynn Langit (20)

Recently uploaded

Recently uploaded (20)

Genomic Scale Big Data Pipelines

Editor's Notes