Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
FINDING NEEDLES IN GENOMIC
HAYSTACKS WITH “WIDE”
RANDOM FOREST
Piotr Szul
CSIRO Data61
Which needle is the right one?
All humans carry between 200 to
800 mutation that disrupt the
function of a gene.
The human...
5319
talented
staff
$1billion+	
budget
Working
with over
2800+
industry
partners
55
sites	across	
Australia
Top 1%
of glob...
Agenda
• Intro to Genome Wide Association Studies
• Variant Spark and “Cursed Forest”
• GWAS use-cases
Genome Wide Association Studies
imagecourtesyofPasieka
SciencePhotoLibrary
1000+ samples
Relatively common > 1%
~ 500,000 ...
Look at the data
Typical GWAS: 1M variants x 5K samples
Full genome: 80M variants x 2.5K samples
0 1 0 … 1
1 1 1 … 1
0 0 0...
GWAS
0
2000
4000
6000
8000
10000
12000
2008 2009 2010 2011 2012 2013 2014 2015
GWAS Studies
Associations Studies
2713 stud...
Missing Heritability
Manolio et al. (2009) Finding the missing heritability of complex diseases
… human height heritabilit...
Epistasis
Traditional approach for interaction modeling ’squares’ the problem size
500,000 SNPs à ~100,000,000,000 pairs
Random Forest to the rescue
Lunetta et al. (2004) Screening large-scale association study data:
exploiting interactions us...
Random Forest in GWAS
• Non-parametric and arbitrarily expressive
• Insensitive to outliers and non-informative predictors...
VariantSpark
0
1000
2000
P
ython
R
H
adoop
A
dam
A
D
M
IX
TU
R
E
VariantS
park
method
timeinseconds
task
binary−conversion...
Random Forest SparkML
Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK
• Fa...
“Cursed Forest”
broadcast
aggregate
1
2,1 2,2
Executors
v1
v2
v3v3v3
vn
…
var, point
local best
split
var1, point1
var21, ...
Some implementation tricks
• ”Native” data shape
– VCF files are organized by ”variables”
• Building by levels and tree ba...
How fast it is? 16 CPU cores 32GB RAM
local mode
Big data performance
• Yarn Cluster (12 workers)
– 16 x IntelXeon E5-2660@2.20GHzCPU
– 128 GB of RAM
• Spark 1.6.1 on YARN...
Other features
• Various input formats
– VCF, CSV, parquet
• A variety of RF (fine) tuning
parameters
– Sampling
– Depths
...
Simulated Data Study
• Synthetic dataset of 2.5M variables and 5000 samples
• 5 informative variables with dichotomous res...
Bone Mineral Density Study
• Osteoporotic fracture is a leading cause of morbidity
and mortality particularly amongst the ...
Bone Mineral Density Study
• 2036 samples & 288,768
SNPs
• Replicates 21 of 26 known
associated genes
• Identifies 2 novel...
BMD - VariantSpark Results
Known BMD locations have
significantly higher ranking
(Mann-Whitney U, p = 1.3e-7)
A few novel ...
Future work and directions
Techchnical
Compare and
’merge’ with
yggdrasil
Deployment on
cloud platforms
Further
performanc...
References
1. Aleesha Bates (2016) Practical aspects of GWASAssociation studies under statistical
genetics and GenABEL han...
Conclusions
Apache Spark is a feasible platform machine learning
in population scale genomics.
VariantSpark with CursedFor...
Thank You
Email: piotr.szul@data61.csiro.au
Github: https://github.com/csirobigdata/variant-spark
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul
Upcoming SlideShare
Loading in …5
×

Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul

620 views

Published on

Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.

As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.

Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.

To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.

In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit East talk by Piotr Szul

  1. 1. FINDING NEEDLES IN GENOMIC HAYSTACKS WITH “WIDE” RANDOM FOREST Piotr Szul CSIRO Data61
  2. 2. Which needle is the right one? All humans carry between 200 to 800 mutation that disrupt the function of a gene. The human genome is 3 billion letters long. Finding genetic underpinnings fordiseases and phenotypic traits What are the biologicalmechanism? Who is at risk for a disease? How to prevent and treat?
  3. 3. 5319 talented staff $1billion+ budget Working with over 2800+ industry partners 55 sites across Australia Top 1% of global research agencies Each year 6 CSIRO technologies contribute $5 billion to the economy
  4. 4. Agenda • Intro to Genome Wide Association Studies • Variant Spark and “Cursed Forest” • GWAS use-cases
  5. 5. Genome Wide Association Studies imagecourtesyofPasieka SciencePhotoLibrary 1000+ samples Relatively common > 1% ~ 500,000 SNPs
  6. 6. Look at the data Typical GWAS: 1M variants x 5K samples Full genome: 80M variants x 2.5K samples 0 1 0 … 1 1 1 1 … 1 0 0 0 … 0 0 0 1 … 1 0 1 1 … 1 0 0 0 … 0 1 2 0 … 0 ......... ......... 0 0 0 … 2 1 2 0 … 0 samples (103) variants(106) 0 1 0 0 0 0 1 ... 0 1 1 1 0 0 1 0 2 ... 0 2 0 1 0 1 1 0 0 ... 0 0 ..................... 1 1 0 1 1 0 0 ... 2 0 variants x samples transpose D N D . N 1 x samples predictors response associate 0 10,000 20,000 30,000 40,000 50,000 100,000 1,000,000 10,000,000 100,000,000 Studies 1000 Genomes samples variants
  7. 7. GWAS 0 2000 4000 6000 8000 10000 12000 2008 2009 2010 2011 2012 2013 2014 2015 GWAS Studies Associations Studies 2713 studies 31183 associations Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci for atopic dermatitis in the Japanese population The NHGRI-EBI Catalog of published genome- wide association studies
  8. 8. Missing Heritability Manolio et al. (2009) Finding the missing heritability of complex diseases … human height heritability is ~80% yet more that 40 associated loci explain only about5% of phenotypic variance … “Dark matter” of genomics
  9. 9. Epistasis Traditional approach for interaction modeling ’squares’ the problem size 500,000 SNPs à ~100,000,000,000 pairs
  10. 10. Random Forest to the rescue Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions using random forests Breiman (2001) Random Forests. Machine Learning
  11. 11. Random Forest in GWAS • Non-parametric and arbitrarily expressive • Insensitive to outliers and non-informative predictors • Stable performance – no overfitting • Easy to tune • Built in error estimate (OOB error) • Variable importance measures • Ability to deal with heterogeneous data • Easy to parallelize and scale on HPC Sun (2010) Multigenic Modeling of Complex Disease by Random Forests RF is an appropriate candidate to capture the genetic heterogeneity underlying the trait because RF itself is an ensemble of many heterogeneous trees built from uncorrelated subsamples of the original data
  12. 12. VariantSpark 0 1000 2000 P ython R H adoop A dam A D M IX TU R E VariantS park method timeinseconds task binary−conversion clustering pre−processing It can cluster 3000 individuals and 80 million variants O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information Natalie TwineDenis Bauer Oscar Luo Rob Dunne Piotr Szul Transformational Bioinformatics Team Aidan O’BrienLaurence Wilson Software Open source (MIT) @ https://github.com/csirobigdata/variant-spark
  13. 13. Random Forest SparkML Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK • Failing for millions of variables • Relatively slow
  14. 14. “Cursed Forest” broadcast aggregate 1 2,1 2,2 Executors v1 v2 v3v3v3 vn … var, point local best split var1, point1 var21, point21 var22, point22 global best split … initial sample split subsets Driver Partition data by variables (columns) • Columns are “small” – easy partition • An executor can find (an exact) best split for many variables • Finding globalbest split is efficient
  15. 15. Some implementation tricks • ”Native” data shape – VCF files are organized by ”variables” • Building by levels and tree batching – Minimize communication overhead and the number of stages • Optimized split finding for ordered factors – The most frequent operation – Java implementation faster then Scala • Choice of data representation – byte representation for variant data – with sparsity 0.75 a sparse vector 3x bigger than a byte array
  16. 16. How fast it is? 16 CPU cores 32GB RAM local mode
  17. 17. Big data performance • Yarn Cluster (12 workers) – 16 x IntelXeon E5-2660@2.20GHzCPU – 128 GB of RAM • Spark 1.6.1 on YARN – 128 executors – 6GB / executor (0.75TB) • Synthetic dataset(mtry = 0.25) Typical GWAS Range 100K trees: 5 – 50h AWS: ~$215.50 Whole Genome Range 100K trees: 200 – 2000h AWS: ~ $ 8620.00 50M variable x 10k samples!
  18. 18. Other features • Various input formats – VCF, CSV, parquet • A variety of RF (fine) tuning parameters – Sampling – Depths – Splitting • Insight into RF model – cumulative OOB error – per tree variable importance – per tree OOB predictions
  19. 19. Simulated Data Study • Synthetic dataset of 2.5M variables and 5000 samples • 5 informative variables with dichotomous response • Compare RF importance ranking with the model • Rank-biased overlap (RBO) – measure of ranking overlap (with emphasis of highly ranked elements) RBO 0 0.5 1 1.5 w_1 w_2 w_3 w_4 w_5
  20. 20. Bone Mineral Density Study • Osteoporotic fracture is a leading cause of morbidity and mortality particularly amongst the elderly. • In 2004 ten millionAmericans were estimated to have osteoporosis, resulting in 1.5 million fractures per annum. • Hip fracture is associated with a one year mortality rate of 36% in men and 21% in women Burden of disease of osteoporotic fractures overall is similar to that of colorectal cancer and greater than that of hypertension and breast cancer Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting BoneMineral Density and Fracture Risk.
  21. 21. Bone Mineral Density Study • 2036 samples & 288,768 SNPs • Replicates 21 of 26 known associated genes • Identifies 2 novel loci (known association with BMD) • Provides strong evidence for further 4 loci Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting BoneMineral Density and Fracture Risk.
  22. 22. BMD - VariantSpark Results Known BMD locations have significantly higher ranking (Mann-Whitney U, p = 1.3e-7) A few novel highly ranked locations with plausible association with BMD: COLEC10, PRODH Not replicated DCDC5 ranked 9,667 out of 10,000
  23. 23. Future work and directions Techchnical Compare and ’merge’ with yggdrasil Deployment on cloud platforms Further performance improvements Functional Implementation of cutting edge research Integration within genomics platforms (GATK4) More ML algorithms Research Applications Data science research Gradient Boosted Trees
  24. 24. References 1. Aleesha Bates (2016) Practical aspects of GWASAssociation studies under statistical genetics and GenABEL hands-on tutorial 2. Hirota at al (2012) Genome-wide association study identifies eight new susceptibility loci for atopic dermatitis in the Japanese population 3. The NHGRI-EBI Catalog of published genome-wide association studies 4. Manolio et al. (2009) Finding the missing heritability of complex diseases 5. Lunetta et al. (2004) Screening large-scale association study data: exploiting interactions using random forests 6. Breiman (2004) Random Forests. Machine Learning 7. Sun (2010) Multigenic Modeling of Complex Disease by Random Forests 8. Danecek et al. The Variant Call Format and VCFtools 9. O’Brien et al. (2015) VariantSpark: population scale clustering of genotype information 10. Firas Abuzaid (Spark Summit 2016) YGGDRASIL: Faster Decision Trees Column Partitioning in SPARK 11. Duncan et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk.
  25. 25. Conclusions Apache Spark is a feasible platform machine learning in population scale genomics. VariantSpark with CursedForest is a promising alternative for traditional GWAS approaches. Data shape, type, etc. matter – different optimizations are needed.
  26. 26. Thank You Email: piotr.szul@data61.csiro.au Github: https://github.com/csirobigdata/variant-spark

×