3. Natalie Twine
Transformational Bioinformatics Team
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Denis Bauer Oscar Luo Rob Dunne Piotr SzulAidan O’BrienLaurence Wilson
Adrian White
Mia Champion
Gaetan Burgio
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai
Kaylene Simpson
Iva Nikolic
Ian Blair
Kelly Williams
5. Unsupervised ML : K-Means
www.cloudaccess.eu
1000 x 40 Million variants
Matrix *
k-means
Predict super
population
4
14 ethnic groups and
s u p e r
populations
VariantSpark | Denis C. Bauer @allPowerde
* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants
6. Comparing k-means to other implementations
0
1000
2000
Python
R
H
adoop
Adam
AD
M
IXTU
R
E
VariantSpark
method
timeinseconds
task
binary−conversion
clustering
pre−processing
103 75 29 28 18 4 min
VariantSpark | Denis C. Bauer @allPowerde
7. Supervised ML: Wide Random Forests
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
9. Performance – Faster and More Accurate
VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
10. Scaling to 50 M variable and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster (12 workers)
• 16 x Intel Xeon E5-2660@2.20GHz CPU
• 128 GB of RAM
• Spark 1.6.1 on YARN
• 128 executors
• 6GB / executor (0.75TB)
• Synthetic dataset (mtry = 0.25)
Whole Genome
Range
GWAS Range