Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
VariantSpark – Apache	Spark	for	Bioinformatics
CSIRO	DATA61
Piotr	Szul	|	Senior	Engineer
Spark	Summit	Europe	2017
Spark	Summit	Europe	2017	|	Piotr	Szul2 |
CSIRO
How	to	facilitate	
better	
collaborations?
Overview
Big	Data	in	
Genomics
G...
Spark	Summit	Europe	2017	|	Piotr	Szul3 |
Team	CSIRO
5319
talented	staff
$1billion+	
budget
Working
with	over
2800+
industr...
Spark	Summit	Europe	2017	|	Piotr	Szul4 |
Big	ideas	start	here
EXTENDED	
WEAR	
CONTACTS
POLYMER	
BANKNOTES
RELENZA	
FLU	TRE...
Bioinformatics	|	Denis	C.	Bauer	|	@allPowerde5 |
Convenient	cardiac	rehabilitation
Enhancing	relationship	between	patient	...
By 2025 it is estimated that
50% of the world population
will have been sequenced.
Spark	Summit	Europe	2017	|	Piotr	Szul6 ...
Genomics	will	outpace	other	BigData	disciplines
Spark	Summit	Europe	2017	|	Piotr	Szul7 |
Stephens et al. PLOS Biology 2015...
BMC	Genomics	2015,	16:1052	PMID:	26651996	(IF=4)	
VariantSpark learns	from	3000	individuals	
and	80	million	mutations	in	u...
Genomic	Research	Workflow
Spark	Summit	Europe	2017	|	Piotr	Szul9 |
https://www.projectmine.com/about/
Focus
Finding	the	disease	gene(s)
Spark	Summit	Europe	2017	|	Piotr	Szul10 |
Spot	the	variant	that	is	
common	amongst	all	
affect...
Complex	diseases	are	driven	by	joint-loci
Bioinformatics	|	Denis	C.	Bauer	|	@allPowerde11 |
• However,	individual	strong	c...
Machine	learning	on	1.7	Trillion	data	points
Spark	Summit	Europe	2017	|	Piotr	Szul12 |
80	Million	features
Individuals Gen...
Look	at	the	data
Typical	GWAS:			1M	variants	x	5K	samples
Full	genome:				80M	variants	x	2.5K	samples	
0 1 0 … 1
1 1 1 … 1...
Why	we	needed	to	re-implement	RF
Spark	Summit	Europe	2017	|	Piotr	Szul14 |
• Spark	ML’s	RF	was	designed	for	‘Big’	low	dime...
How	do	other	people	try	to	solve	this	issue
Spark	Summit	Europe	2017	|	Piotr	Szul15 |
Firas Abuzaid (Spark	Summit	2016)	YG...
“Cursed	Forest”
Spark	Summit	Europe	2017	|	Piotr	Szul16 |
Flip	and	chop:	partition	by	variables
broadcast
aggregate
1
2,1 ...
Supervised:	Cursed	Forest
Spark	Summit	Europe	2017	|	Piotr	Szul17 |
Variant	Spark	– ML	for	Genomics	Variants
18 |
https://github.com/aehrc/VariantSpark
Spark	Summit	Europe	2017	|	Piotr	Szul
Improving	Research	Collaboration
• Quickly	access	a	managed	Spark	cluster	- AWS	EC2	/	spot	instances
• Link	to	your	data	a...
Try	it	on	your	data
HipsterScore = ((2 + B6) * (1.5 + R1)) + ((0.5 + C2) * (1 + B2)) ; HipsterScore >10 =1
Spark	Summit	Eu...
Comparing	VariantSpark with	Hail
Spark	Summit	Europe	2017	|	Piotr	Szul21 |
Big	data	performance
Typical
GWAS
Range
100K	trees:	5	– 50h
AWS:	~$215.50	
Whole	
Genome
Range
100K	trees:	200	– 2000h
AWS...
Spark	Summit	Europe	2017	|	Piotr	Szul23 |
Transformational	Bioinformatics
Denis	Bauer,	
PhD
Oscar	Luo,	
PhD
Rob	Dunne,	
Ph...
Github:		https://github.com/aehrc/VariantSpark
Databricks Blog	Post:	https://tinyurl.com/y7l9rzkp
Email:	Piotr.Szul@csiro....
Upcoming SlideShare
Loading in …5
×

Variant-Apache Spark for Bioinformatics with Piotr Szul

388 views

Published on

This talk will showcase work done by the bioinformatics team at CSIRO in Sydney, Australia to make Spark more useful and usable for the bioinformatics community. They have created a custom library, variant-spark, which provides a DSL and also a custom implementation of Spark ML via random forests for genomic pipeline processing. We’ve created a demo, using their ‘Hipster-genome’ and a Databricks notebook to better explain their library to the world-wide bioinformatics community. This notebooks compares results with another popular genomics library (HAIL.io) as well.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Variant-Apache Spark for Bioinformatics with Piotr Szul

  1. 1. VariantSpark – Apache Spark for Bioinformatics CSIRO DATA61 Piotr Szul | Senior Engineer Spark Summit Europe 2017
  2. 2. Spark Summit Europe 2017 | Piotr Szul2 | CSIRO How to facilitate better collaborations? Overview Big Data in Genomics Genomics data challenge. VariantSpark How to find disease genes in population- size cohorts?
  3. 3. Spark Summit Europe 2017 | Piotr Szul3 | Team CSIRO 5319 talented staff $1billion+ budget Working with over 2800+ industry partners 55 sites across Australia Top 1% of global research agencies Each year 6 CSIRO technologies contribute $5 billion to the economy
  4. 4. Spark Summit Europe 2017 | Piotr Szul4 | Big ideas start here EXTENDED WEAR CONTACTS POLYMER BANKNOTES RELENZA FLU TREATMENT Fast WLAN Wireless Local Area Network AEROGARD TOTAL WELLBEING DIET RAFT POLYMERISATION BARLEYmax™ SELF TWISTING YARN SOFTLY WASHING LIQUID HENDRA VACCINE NOVACQ™ PRAWN FEED
  5. 5. Bioinformatics | Denis C. Bauer | @allPowerde5 | Convenient cardiac rehabilitation Enhancing relationship between patient and mentor Digital data collection Equitable access World's first, clinically validated smartphone based Cardiac Rehab: uptake + 30% and completion +70%
  6. 6. By 2025 it is estimated that 50% of the world population will have been sequenced. Spark Summit Europe 2017 | Piotr Szul6 | Frost&Sullivan
  7. 7. Genomics will outpace other BigData disciplines Spark Summit Europe 2017 | Piotr Szul7 | Stephens et al. PLOS Biology 2015 Astronomy Twitter YouTube Genomics
  8. 8. BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4) VariantSpark learns from 3000 individuals and 80 million mutations in under 30 minutes Cited 7 Spark Summit Europe 2017 | Piotr Szul8 | In the top 5% of all research outputs scored by Altmetric 31 Spark Core Spark ML MLlib Variant Spark RESEARCH 0 1000 2000 Python R H adoop Adam AD M IXTU R E VariantSpark method timeinseconds task binary−conversion clustering pre−processing
  9. 9. Genomic Research Workflow Spark Summit Europe 2017 | Piotr Szul9 | https://www.projectmine.com/about/ Focus
  10. 10. Finding the disease gene(s) Spark Summit Europe 2017 | Piotr Szul10 | Spot the variant that is common amongst all affected but absent in all unaffected* * oversimplified cases controls Gene1 Gene2
  11. 11. Complex diseases are driven by joint-loci Bioinformatics | Denis C. Bauer | @allPowerde11 | • However, individual strong contributors are rare… cases controls Need a more sophisticated ML approach, such as Random Forest on 1.7 Trillion data points
  12. 12. Machine learning on 1.7 Trillion data points Spark Summit Europe 2017 | Piotr Szul12 | 80 Million features Individuals Genomic profile Disease status22,500 samples Disease association identified by GWAS Spark Summit 2017 by Cotton Seed (MIT)
  13. 13. Look at the data Typical GWAS: 1M variants x 5K samples Full genome: 80M variants x 2.5K samples 0 1 0 … 1 1 1 1 … 1 0 0 0 … 0 0 0 1 … 1 0 1 1 … 1 0 0 0 … 0 1 2 0 … 0 ......... ......... 0 0 0 … 2 1 2 0 … 0 samples (103) variants (106) 0 1 0 0 0 0 1 ... 0 1 1 1 0 0 1 0 2 ... 0 2 0 1 0 1 1 0 0 ... 0 0 ..................... 1 1 0 1 1 0 0 ... 2 0 variants x samples transpose D N D . N 1 x samples predictors response associate 0 10,000 20,000 30,000 40,000 50,000 100,000 1,000,000 10,000,000 100,000,000 Studies 1000 Genomes samples variants
  14. 14. Why we needed to re-implement RF Spark Summit Europe 2017 | Piotr Szul14 | • Spark ML’s RF was designed for ‘Big’ low dimensional data. • The full genome-wide profile does not fit into the executors memory rendering the approach infeasible. “Cursed” BigData: e.g. Genomics Moderate number of samples with many features Feature set too large to be handled by single executer
  15. 15. How do other people try to solve this issue Spark Summit Europe 2017 | Piotr Szul15 | Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK Flip the matrix: partition by column
  16. 16. “Cursed Forest” Spark Summit Europe 2017 | Piotr Szul16 | Flip and chop: partition by variables broadcast aggregate 1 2,1 2,2 Executors v1 v2 v3v3v3 vn … var, pointlocal best split var1, point1 var21, point21 var22, point22 global best split … initial sample split subsets Driver • Columns are “small” – easy partition • An executor can find (an exact) best split for many variables • Finding global best split is efficient
  17. 17. Supervised: Cursed Forest Spark Summit Europe 2017 | Piotr Szul17 |
  18. 18. Variant Spark – ML for Genomics Variants 18 | https://github.com/aehrc/VariantSpark Spark Summit Europe 2017 | Piotr Szul
  19. 19. Improving Research Collaboration • Quickly access a managed Spark cluster - AWS EC2 / spot instances • Link to your data and perform whole genome analysis in real-time Jupyter Notebook Phenotype = ((2 + B6) * (1.5 + R1)) + ((0.5 + C2) * (1 + B2)) Demonstration
  20. 20. Try it on your data HipsterScore = ((2 + B6) * (1.5 + R1)) + ((0.5 + C2) * (1 + B2)) ; HipsterScore >10 =1 Spark Summit Europe 2017 | Piotr Szul20 | https://aehrc.github.io/VariantSpark/notebook-examples/VariantSpark_HipsterIndex_Spark2.html
  21. 21. Comparing VariantSpark with Hail Spark Summit Europe 2017 | Piotr Szul21 |
  22. 22. Big data performance Typical GWAS Range 100K trees: 5 – 50h AWS: ~$215.50 Whole Genome Range 100K trees: 200 – 2000h AWS: ~ $ 8620.00 (128 CPU CORES) 50M variable x 10k samples!
  23. 23. Spark Summit Europe 2017 | Piotr Szul23 | Transformational Bioinformatics Denis Bauer, PhD Oscar Luo, PhD Rob Dunne, PhD Piotr Szul Team Aidan O’BrienLaurence Wilson, PhD Adrian White Andy Hindmarch Collaborators David Levy News Software Dan Andrews Kaitao Lai, PhD Arash Bayat John Hildebrandt Mia Chapman Ian Blair Kelly Williams Jules Damji Gaetan Burgio Lynn Langit Natalie Twine, PhD
  24. 24. Github: https://github.com/aehrc/VariantSpark Databricks Blog Post: https://tinyurl.com/y7l9rzkp Email: Piotr.Szul@csiro.au Thank you CSIRO DATA61

×