Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

VariantSpark: applying Spark-based machine learning methods to genomic information


Published on

Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
The package is written in Scala and available at

Published in: Science
  • Be the first to comment

VariantSpark: applying Spark-based machine learning methods to genomic information

  1. 1. Denis C. Bauer | Bioinformatics | @allPowerde 20 Nov 2015 CSIRO HEALTH & BIOSECURITY VariantSpark: applying Spark-based machine learning methods to genomic information ByTimCooper
  2. 2. Talk Overview VariantSpark| Denis C. Bauer @allPowerde2 | • Background: Why is genomics so important for medicine • VariantSpark: Overview • Whole Genome Analysis: Clustering samples by ethnicity
  3. 3. Genome sequencing improves diagnostics Genomic sequencing can lead to a successful diagnosis in up to 50% of cases where traditional genetic testing failed and is on average 96% cheaper Presentation title | Presenter name3 | Oncology Tandem duplications Tandem duplications Identifying tumours by their genome- wide mutation profiles Rare genetic disorders Identifying causative mutations by interrogating all abnormal variants Bauer et al. Trends Mol Med. 2014 PMID: 24801560
  4. 4. Generating data from 1 Million Americans Presentation title | Presenter name4 | THEPRECISIONMEDICINEINITIATIVE WHATISIT? Precision medicine is an emerging approach for disease prevention and treatment that takes into account people’s individual variations in genes, environment, and lifestyle. The Precision Medicine Initiative will generate the scientific evidence needed to move the concept of precision medicine into clinical practice. WHYNOW? The time is right because of: Sequencing of the human genome Improved technologies for biomedical analysis New tools for using large datasets NEAR TERM GOALS Intensify efforts to apply precision medicine to cancer. Innovative clinical trials of targeted drugs for adult, pediatric cancers Use of combination therapies Knowledge to overcome drug resistance LONGER TERM GOALS Create a research cohort of > 1 million American volunteers who will share genetic data, biological samples, and diet/lifestyle information, all linked to their electronic health records if they choose. Pioneer a new model for doing science that emphasizes engaged participants, responsible data sharing, and privacy pr otection. Research based upon the cohort data will: • Advance pharmacogenomics, the right drug for the right patient at the right dose • Identify new targets for treatment and prevention Australia: ~ 100 Million dedicated to clinical genomics • $25 Million Australian Genomics Health Alliances (NHMRC Grant AI Denis); • VIC and QLD Alliances ($25 Million each); NSW and ACT (undisclosed $$ to Garvan and JCSMR)
  5. 5. 100,000 Genomes project 70,000 individuals by 2017 The cancer genome atlas 11,000 samples 2015 Genomics projects are getting bigger VariantSpark| Denis C. Bauer @allPowerde | Page 5 The HapMap Project 270 samples 2002 Human genome ~1 sample 1000 Genome Project 1097 samples 2012 Project MinE 15,000 people with ALS ASPREE 4000 healthy 70+ year olds Single samples are around 200GB in size
  6. 6. VariantSpark| Denis C. Bauer @allPowerde | Page 6
  7. 7. Data Analysis categories for genomics Map to genome and generate raw genomic features (e.g. SNPs) Analyze the data; Uncover the biological meaning Produce raw sequence readsBasic Production Informatics Advanced Production Inform. Bioinformatics Research VariantSpark| Denis C. Bauer @allPowerde | Page 7
  8. 8. VariantSpark Mllib* VCF VariantSpark is the interface enabling Spark’s MLlib machine learning algorithms to be applied to genomics data e.g. grouping samples by genomic profile Input Genomics Application Result Largescale compute VariantSpark| Denis C. Bauer @allPowerde | Page 8 * VariantSpark also uses Spark.ML
  9. 9. VariantSpark VariantSpark| Denis C. Bauer @allPowerde | Page 9 Accepted BMC Genomics (IF=4)
  10. 10. Cluster individuals into ethnic groups based on their genomic profiles 1000 x 40 Million variants Matrix * Kmeans Predict super population 4 14 ethnic groups and s u p e r populations VariantSpark| Denis C. Bauer @allPowerde | Page 10 * VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants
  11. 11. Clustering result • (adjusted Rand index) ARI = 0.84, with -1 (independent labeling) and 1 (perfect match) • Majority of American (AMR) individuals being placed in the same group as Europeans (EUR), likely reflecting their migrational backgrounds. • ADMIXTURE (state-of-the-art tool for population structure determination) returns a low ARI of 0.25 Admixture: Alexander, D.H., Novembre, J., Lange, K.: Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19(9), 1655–1664 (2009) VariantSpark| Denis C. Bauer @allPowerde | Page 11
  12. 12. Comparison to other implementations • Preprocessing: converting location-centric VCF genotypes into sample- centric numerical vectors • Clustering: Kmeans • ADAM (BigData Genomics): Spark implementation with dense matrix • Hadoop: MapReduce without in- memory caching 0 1000 2000 Python R H adoop Adam AD M IXTU R E VariantSpark method timeinseconds task binary−conversion clustering pre−processing Chromosome 22; VM on Microsoft Azure with A7 Linux instance and 8 cores, 56GB memory running Ubuntu. 103 75 29 28 18 4 min VariantSpark| Denis C. Bauer @allPowerde | Page 12
  13. 13. Scaling VariantSpark to the whole genome • Pre-processing: scales seamlessly as processes are independent • Clustering: memory consumption increases linear with number of variants (24GB) due to additional distance measurements between variants and k-means centroids • As total memory was the limiting factor on our infrastructure the number of simultaneously used nodes had to be reduced; increasing runtime. pre−processing clustering 40 45 50 55 60 65 5 10 15 20 25 0 20000 40000 executorsmemorytime 20 40 60 80 100 20 40 60 80 100 number of variants (%) value variable executors memory time CSIRO Spark Cluster: Whole genome; Hadoop 2.5.0, managed by cloudera’s CDH 5. We use Spark 1.3.1. This 13 node cluster has a total of 416 cores and 1.22TB memory. VariantSpark| Denis C. Bauer @allPowerde | Page 13
  14. 14. Three things to remember • VariantSpark is an interface bringing bigLearning tasks to genomics applications • VariantSpark can cluster 3000 individuals and 80 million variants in under 30 hours using minimal memory (24GB) – a task not being possible in R/python/ADMIXTURE due to memory limits. • VariantSpark outperforms ADAM (Big Data Genomics) and equivalent Hadoop-implementation by almost an order of magnitude. VariantSpark| Denis C. Bauer @allPowerde | Page 14
  15. 15. HEALTH AND BIOSECURITY Thank youHealth & Biosecurity Denis C. Bauer t +61 2 9123 4567 e w informatics/transformational-bioinformatics/ More talks online: Twitter: @allPowerde Aidan O’Brien Bill Wilson Transformational Bioinformatics Team, CSIRO Former members Firoz Anwar Neil Saunders Rodney Scott Newcastle University Funding: National Health and Medical Research Council; National Breast Cancer Foundation; CSIRO's Transformational Capability Platform; CSIRO’s IM&T; Science and Industry Endowment Fund Buske et al., Bioinformatics Jan 2014 O’Brien et al., BMC Genomics Dec 2015 Dunne et al., in preparation FullySIC Epistatic Gene Network modelling in preparation Anwar et al., in preparation Piotr Szul Gi Guo Robert Dunne Data61 CSIRO, Australia GOdistinct GO Enrichment or genesets with distinctive function
  16. 16. Presentation title | Presenter name16 |