Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul
1. FINDING NEEDLES IN GENOMIC
HAYSTACKS WITH “WIDE”
RANDOM FOREST
Piotr Szul
CSIRO Data61
2. Which needle is the right one?
All humans carry between 200 to
800 mutation that disrupt the
function of a gene.
The human genome is 3 billion letters long.
Finding genetic underpinnings fordiseases and phenotypic traits
What are the biologicalmechanism?
Who is at risk for a disease?
How to prevent and treat?
8. GWAS
0
2000
4000
6000
8000
10000
12000
2008 2009 2010 2011 2012 2013 2014 2015
GWAS Studies
Associations Studies
2713 studies
31183 associations
Hirota at al (2012) Genome-wide association study identifies eight new
susceptibility loci for atopic dermatitis in the Japanese population The NHGRI-EBI Catalog of published genome-
wide association studies
9. Missing Heritability
Manolio et al. (2009) Finding the missing heritability of complex diseases
… human height heritability is ~80% yet more
that 40 associated loci explain only about5%
of phenotypic variance …
“Dark matter” of
genomics
11. Random Forest to the rescue
Lunetta et al. (2004) Screening large-scale association study data:
exploiting interactions using random forests
Breiman (2001) Random Forests.
Machine Learning
12. Random Forest in GWAS
• Non-parametric and arbitrarily expressive
• Insensitive to outliers and non-informative predictors
• Stable performance – no overfitting
• Easy to tune
• Built in error estimate (OOB error)
• Variable importance measures
• Ability to deal with heterogeneous data
• Easy to parallelize and scale on HPC
Sun (2010) Multigenic Modeling of Complex Disease by Random Forests
RF is an appropriate candidate to capture the genetic heterogeneity
underlying the trait because RF itself is an ensemble of many heterogeneous
trees built from uncorrelated subsamples of the original data
14. Random Forest SparkML
Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK
• Failing for millions of variables
• Relatively slow
15. “Cursed Forest”
broadcast
aggregate
1
2,1 2,2
Executors
v1
v2
v3v3v3
vn
…
var, point
local best
split
var1, point1
var21, point21 var22, point22
global
best split
…
initial sample
split subsets
Driver
Partition data by variables
(columns)
• Columns are “small” –
easy partition
• An executor can find (an
exact) best split for many
variables
• Finding globalbest split
is efficient
16. Some implementation tricks
• ”Native” data shape
– VCF files are organized by ”variables”
• Building by levels and tree batching
– Minimize communication overhead and the number of stages
• Optimized split finding for ordered factors
– The most frequent operation
– Java implementation faster then Scala
• Choice of data representation
– byte representation for variant data
– with sparsity 0.75 a sparse vector 3x bigger than a byte array
17. How fast it is? 16 CPU cores 32GB RAM
local mode
18. Big data performance
• Yarn Cluster (12 workers)
– 16 x IntelXeon E5-2660@2.20GHzCPU
– 128 GB of RAM
• Spark 1.6.1 on YARN
– 128 executors
– 6GB / executor (0.75TB)
• Synthetic dataset(mtry = 0.25)
Typical
GWAS
Range
100K trees: 5 – 50h
AWS: ~$215.50
Whole
Genome
Range
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
50M variable x 10k samples!
19. Other features
• Various input formats
– VCF, CSV, parquet
• A variety of RF (fine) tuning
parameters
– Sampling
– Depths
– Splitting
• Insight into RF model
– cumulative OOB error
– per tree variable importance
– per tree OOB predictions
20. Simulated Data Study
• Synthetic dataset of 2.5M variables and 5000 samples
• 5 informative variables with dichotomous response
• Compare RF importance ranking with the model
• Rank-biased overlap (RBO) – measure of ranking
overlap (with emphasis of highly ranked elements)
RBO
0
0.5
1
1.5
w_1 w_2 w_3 w_4 w_5
21. Bone Mineral Density Study
• Osteoporotic fracture is a leading cause of morbidity
and mortality particularly amongst the elderly.
• In 2004 ten millionAmericans were estimated to have
osteoporosis, resulting in 1.5 million fractures per
annum.
• Hip fracture is associated with a one year mortality
rate of 36% in men and 21% in women
Burden of disease of osteoporotic fractures overall is
similar to that of colorectal cancer and greater than that of
hypertension and breast cancer
Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel
Genes Affecting BoneMineral Density and Fracture Risk.
22. Bone Mineral Density Study
• 2036 samples & 288,768
SNPs
• Replicates 21 of 26 known
associated genes
• Identifies 2 novel loci (known
association with BMD)
• Provides strong evidence for
further 4 loci
Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel
Genes Affecting BoneMineral Density and Fracture Risk.
23. BMD - VariantSpark Results
Known BMD locations have
significantly higher ranking
(Mann-Whitney U, p = 1.3e-7)
A few novel highly ranked
locations with plausible
association with BMD: COLEC10,
PRODH
Not replicated DCDC5 ranked
9,667 out of 10,000
24. Future work and directions
Techchnical
Compare and
’merge’ with
yggdrasil
Deployment on
cloud platforms
Further
performance
improvements
Functional
Implementation
of cutting edge
research
Integration
within genomics
platforms
(GATK4)
More ML
algorithms
Research Applications
Data science
research Gradient Boosted
Trees
25. References
1. Aleesha Bates (2016) Practical aspects of GWASAssociation studies under statistical
genetics and GenABEL hands-on tutorial
2. Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci
for atopic dermatitis in the Japanese population
3. The NHGRI-EBI Catalog of published genome-wide association studies
4. Manolio et al. (2009) Finding the missing heritability of complex diseases
5. Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions
using random forests
6. Breiman (2004) Random Forests. Machine Learning
7. Sun (2010) Multigenic Modeling of Complex Disease by Random Forests
8. Danecek et al. The Variant Call Format and VCFtools
9. O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information
10. Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column
Partitioning in SPARK
11. Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection
Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.
26. Conclusions
Apache Spark is a feasible platform machine learning
in population scale genomics.
VariantSpark with CursedForest is a promising
alternative for traditional GWAS approaches.
Data shape, type, etc. matter – different optimizations
are needed.