Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

1,879 views

Published on

Genomics and Health data is nowadays one of the hot topics requiring lots of computations and specially machine learning. This helps science with a very relevant societal impact to get even better outcome. That is why Apache Spark and its ADAM library is a must have.


This talk will be twofold.


First, we'll show how Apache Spark, MLlib and ADAM can be plugged all together to extract information from even huge and wide genomics dataset. Everything will be packed into examples from the Spark Notebook, showing how bio-scientists can work interactively with such a system.

Second, we'll explain how these methodologies and even the datasets themselves can be shared at very large scale between remote entities like hospitals or laboratories using micro services leveraging Apache Spark, ADAM, Play Framework 2, Avro and Tachyon.

Published in: Technology
  • Be the first to comment

Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

  1. 1. by Data Fellas, Spark London Meetup July, 1st ‘15 Share and analyse genomic data at scale with Spark, Adam, Tachyon and the Spark Notebook
  2. 2. PART I Adam: genomics on Spark 1K Genomes in Adam on S3 Explore: Compute Stats Learn: train a model Outline PART II GA4GH: Standard for Genomics med-at-scale project Explore: using Standards Create custom micro services
  3. 3. Andy Petrella @noootsab Maths scala Apache Spark Spark Notebook Trainer Data Banana Xavier Tordoir @xtordoir Physics Bioinformatics Scala Spark
  4. 4. PART I Spark & Genomics Adam: genomics on Spark 1K Genomes in Adam on S3 Explore: Compute Stats Learn: train a model So that’s the thing that separates us?
  5. 5. Adam What is genomics data Okay, sounds good. Give me two of them! Genome is an important factor in health: Medical Diagnostics Drug response Diseases mechanisms …
  6. 6. Adam What is genomics data You mean devs are slacking of? On the data production: Fast biotech progress No so fast IT progress?
  7. 7. Adam What is genomics data No! They’re just sticky bubbles... On the data production: Sequence {A, T, G, C} 3 billion bases
  8. 8. Adam What is genomics data Okay, a lot of bubbles. On the data production: Sequence {A, T, G, C} 3 billion bases … x 30 (x 60?)
  9. 9. Adam What is genomics data C’mon. a big mess of plenty of lil’ bubbles then. On the data production: massively parallel Sequence {A, T, G, C} 3 billion bases … x 30 (x 60?)
  10. 10. Adam What is genomics data Ah that explain why the black bars are differents
  11. 11. Adam What is genomics data Dude... Tens of millions
  12. 12. Adam What is genomics data Staaaaaaph Tens of millions 1000’s 1,000,000’s …
  13. 13. Adam What is genomics data ‘coz it makes sparkling bubbles, right? Ok, looks like Apache Spark makes a lot of sense here …
  14. 14. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Well done, a spec as text in a pDf…
  15. 15. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Take that
  16. 16. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Dunno what is a Genotype but it contains a Variant. Apparently.
  17. 17. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Yeaaah: generate client == more slack Adam provides an avro schema
  18. 18. Adam An efficient storage Machism in I. T., what a flaw! ● Distribute data ● Schema based ● Read/query efficient ● Compact
  19. 19. Adam An efficient storage That’s a quick step ● Distribute data ● Schema based ● Read/query efficient ● Compact PARQUET!
  20. 20. Adam An efficient storage Is Eve okay to use the parquet for that? ● Distribute data ● Schema based ● Read/query efficient ● Compact PARQUET! Adam provides parquet as storage format
  21. 21. Adam A clean API Object Wrappedy adam Context
  22. 22. Adam A clean API I could have done this as a one liner adam Context IO methods
  23. 23. Adam A clean API At least, it’s going to be simpler than the chemistry ● Scala classes generated from Avro ● Data loaded as RDDs ● functions on RDDs ○ write to HDFS ○ genomic objects manipulations ○ Primitives to query genomics datasets
  24. 24. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam Part of a pipeline human | Seq | SNAP | Avocado | Adam | Ga4gh ADAM is JVM library leveraging - Spark - Avro - Parquet It still needs to be combined with sources (snap) Adam data is part of processes (AVOCADO). It CAN ALSO BE THE SOURCE FOR external PROCESSING, LEARNING (LIKE mllIB).
  25. 25. Thousands Genomes Open Data Set Games without Frontiers 1000 genomes: http://www.1000genomes.org/
  26. 26. Produces BAMs, VCFs, ... Thousands Genomes Why do you complain, they are compressed …
  27. 27. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Thousands Genomes Where are the data DNA Russian roulette: which is fastest? ● EBI FTP: ftp://ftp.1000genomes.ebi.ac. uk/vol1/ftp/ ● NCBI FTP: ftp://ftp-trace.ncbi.nih. gov/1000genomes/ftp/ ● S3: http://aws.amazon.com/1000genomes/ ● GS: gs://genomics-public-data/ftp-trace.ncbi. nih.gov/1000genomes/ftp
  28. 28. Thousands Genomes Adam that shit on S3 Hmmm like in the good old days of HPC The bad part … ● get the vcf.gz file on local disk (& time for a coffee) ● uncompress (& go for lunch) ● put in HDFS (& take dessert)
  29. 29. Thousands Genomes Adam that shit on S3 what? No grappa? The good part … the Notebook (this one)
  30. 30. Thousands Genomes Adam that shit on S3 Okay, good enough to wait a bit… What did we gain? ● before: 152 GB (gzipped) in 23 files ● After: 71 GB in 9172 partitions (43,372,735,220 genotypes)
  31. 31. Explore Genomics Access the data Just in case, you don’t believe us -_-’ Access data from this notebook
  32. 32. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Explore Genomics Compute statistics We’re there to compute, right? Compute Freqs from this spark notebook
  33. 33. Learn Genomics The problem Insane, you’ll have hard time with me |:-[ How to deal with heterogenous data? ● Population stratification ● Identify natural clusters ● Assign genomes to these clusters
  34. 34. Learn Genomics The dimensions Wiiiiiiiiiiiiiiiiide rows ● 1000 Samples (Rows) ● 30,000,000 variants (columns or variables) Hard to explore such a feature space…
  35. 35. Learn Genomics The dimensions *LDA for Latent Dirichelet Allocation… Dimensionality reduction? ● Ideal would be a “Genetic” Mixture measure (lda* would do that…) ● Or a genetic distance (edit distance) KMeans & distances to centroids
  36. 36. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Learn Genomics The model Reduce, train, validate, infer ● Split training/validation set ● Train KMeans with 25 clusters ● Compute distances to each centroid as new features ● Train Random Forest ● Validation
  37. 37. Learn Genomics The notebook Define and train the model in this Notebook The whole shebang?
  38. 38. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam Our pipeline I am a Llama Convert VCFs to ADAM StoRE ADAM to S3 Compute alleles frequencies Store alleles frequencies to S3 Compute Minor Allele frequency distribution Train a Model for stratification Hmmm… quite some missing pieces, right?
  39. 39. PART II Standards & Micro Services Wake up! GA4GH: Standard for Genomics med-at-scale project Explore: using Standards Create custom micro services
  40. 40. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Ga4GH Let’s fix the baseline In I.T. it’s easy everything is standardized… Global Alliance for Genomic and Health http://genomicsandhealth.org/ http://ga4gh.org/ Framework for responsible data sharing ● Define schemas ● Define services Along with Ethical, Legal, security, clinical aspects
  41. 41. GA4GH models … everybody has is own standard
  42. 42. GA4GH Services But a shared schema is a bit better!
  43. 43. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. GA4GH Metadata The data of my data is also my data Work In Progress ● Individual ● Sample ● Experiment ● Dataset ● IndividualGroup ● Analysis But still very young and too much centered on data Beacon ⁽*⁾ Tells the world you have data. CLearly not enough
  44. 44. Med At Scale By Data Fellas Existing scalable implementation: Google Genomics Uses ● BigQuery ● google cloud computing ● dremel ● … That’s what happens when you think you have…
  45. 45. Med At Scale By Data Fellas Google Genomics is pushing Hard …
  46. 46. Med At Scale Scalability first BIG There is another scalable implementation: Med At Scale, by Data Fellas Uses ● Apache Spark ● Adam ● S3 ● HDFS ● …
  47. 47. Med At Scale Scalability first Data Fellas is pushing TOO BIG
  48. 48. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Composability very BIG GA4GH defines quite some methods, or services They don’t have all the same requirements in term of exposure and data processing → micro services for the Win Allows granular deployment and composition/chaining of methods to answer a global question
  49. 49. Med At Scale Customization Data Fellas is a data science company Thus our goal is to expose data analyses A data analysis is ● elaborated in a notebook ● validated on a cluster ● deployed as a micro service it self Still defining a Schema and Service VERY VERY BIG
  50. 50. Med At Scale Ready for the load Balls! We saw that one row has 30,000,000 columns The queries are slicing and dicing those columns → views are huge Hence, Tachyon via RDD.persist/save will optimize the collocated queries in space and time. The hard part (will/)is to size the tachyon cluster
  51. 51. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Ad Hoc Analytics Who left the rats out? Standards are very important However, they cannot define everything, mostly OLAP. Ad-Hoc analytics are thus allowed on the raw data using Apache Spark directly. Of course, interactivity is a key to performance… hence the Spark-Notebook is involved.
  52. 52. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale How it works Finally…
  53. 53. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale ADAM (and Spark) Finally…
  54. 54. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale MLlib (and Spark) Finally…
  55. 55. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Efficient binary data Finally…
  56. 56. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Micro Service Finally…
  57. 57. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Cache and Collaboration Finally…
  58. 58. Explore Using GA4GH endpoints notebook TIME! Use scala/Java Avro client from the browser. I give you Bananas You give me Ananas
  59. 59. Customize Create and Use micro service (WIP) Planning the next gear Remember the frequencies use case? There is a custom endpoint manually created We’re working on an Integrated Workflow In a notebook: ● create the process ● create Cassandra schema ● persist (using connector) ● Define service AVRO IDL ● Generate project for DCOS ● Log usage (see next)
  60. 60. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Optimization Query mining (Roadmap) Always look at the bright side Back to the high dimensionality problem Caching beforehands is a good solution but is not optimal. Plan: ANalyse the Request/Response objects and the gathered runtime metrics to adapt the caching policies -- query mining processes
  61. 61. References Adam: https://github.com/bigdatagenomics/adam Bdg-Formats: https://github.com/bigdatagenomics/bdg-formats GA4GH website: http://genomicsandhealth.org/ GA4GH data working group: http://ga4gh.org/ Spark-Notebook: https://github.com/andypetrella/spark-notebook/ Med-At-Scale: https://github.com/med-at-scale/high-health Data Fellas: http://data-fellas.guru/
  62. 62. Q/A⁽*⁾ THANKS! ⁽*⁾ or head to the pub (at least beers…)

×