Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
Spark Summit Europe: Share and analyse genomic data at scale
1. Share and analyse genomic data
at scale
with Spark, Adam, Tachyon & the Spark Notebook
by @DataFellas, Oct • 29th • 2015
2. Outline
● Sharp intro to Genomics data
● What are the Challenges
● Distributed Machine Learning to the rescue
● Projects: Distributed teams
● Research: Long process
● Towards Maximum Share for efficiency
3. Andy Petrella
Maths
Geospatial
Distributed Computing
Spark Notebook
Trainer Spark/Scala
Machine Learning
Xavier Tordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
trainer Spark
Machine Learning
“There must be another way of doing the credits” -- Robin Hood: Men in Tights (1993, Mel Brooks)
4. Analyse Genomic At Scale
Spark, Adam, Spark Notebook
➔ Sharp intro to Genomics data
➔ What are the Challenges
➔ Distributed Machine Learning to the rescue
5. What is genomics data?
DNA?
What makes us what we are…
… a complex biochemical soup.
With applications to medical diagnostics, drug response,
disease mechanisms
6. What is genomics data?
DNA?
What makes us what we are…
… a complex biochemical soup.
With applications to medical diagnostics, drug response,
disease mechanisms
18. So what do we do with this?
Study variations between populations
Descriptive statistics
Machine Learning (Population stratification or Supervised
learning)
… and share and replay!
19. The Spark Notebook
… comes to the rescue.
+ Self described and consistent
+ Easily shared (code)
+ Scala (types, production quality)
+ Reactive&pluggage charts API (scala = no.js)
+ easy install, no deps.
+ multiple sparkContext
http://www.spark-notebook.io
23. So what do we do with this?
… and share and replay!
Code can be shared easily but we want more...
How do we share data produced by the notebook?
How do we publish the notebook as a service?
24. Share Genomic At Scale
Spark, Tachyon, Mesos, Shar3
➔ Projects: Distributed teams
➔ Research: Long process
➔ Towards Maximum Share for efficiency
25. Projects
Intrinsically involving many teams
geolocally distributed in different
countries or laboratories
with different skills in
Biology, Genetics, I.T., Medicine (, legal...)
29. Projects
Need proper data management between entities, yet
coping with:
amount of data
heterogeneity of people
distance between actors
constraints related to data location
32. Research
Most reasons are quite obvious and must not be overlooked
Lots of measures and validation
Lots of control (including by Gov.)
Lots of actors
33. Research
As a matter of fact, research needs
to be conducted on data and
to produce results
And both are extremely exposed to reuse
So what if we lose either of them?
34. Research
However, we can get into troubles instantly
without even losing them!
What if we don’t track the processes?
In any scientific process: confrontation, replay and
enhancement are keys to move forward
35. This is misleading to think that sharing the code is enough.
Remind: we look for data and results, not for code.
The process includes the code, the context, the sources
and so on, and all should be part of the data
discovery/validation task
Research
36. Assess the risk factor associated with a disease given
mutations of a certain gene.
More than 50 years of data collecting and modelling.
Hundreds of researchers, each generation has new ideas.
Replaying old processes on new data,
new processes on old data
Research
37. Share share
share
All these facts relate to our capacity to share our work and
to collaborate.
We need to share efficiently and accurately the
★ data
★ processes
★ results
39. Share share
share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
40. Share share
share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
43. That’s all folks
Thanks for listening/staying
Poke us on Twitter or via http://data-fellas.guru
@DataFellas @Shar3_Fellas @SparkNotebook
@Xtordoir & @Noootsab
Building Distributed Pipelines for Data Science using
Kafka, Spark, and Cassandra (form → @DataFellas)
Check also @TypeSafe: http://t.co/o1Bt6dQtgH