Share and analyse genomic data
at scale
with Spark, Adam, Tachyon & the Spark Notebook
by @DataFellas, Oct • 29th • 2015
Outline
• Sharp intro to Genomics data
• What are the Challenges
• Distributed Machine Learning to the rescue
• Projects: Distributed teams
• Research: Long process
• Towards Maximum Share for efficiency
Andy Petrella
Maths
Geospatial
Distributed Computing
Spark Notebook
Trainer Spark/Scala
Machine Learning
Xavier Tordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
trainer Spark
Machine Learning
“There must be another way of doingthecredits” -- Robin Hood: Menin Tights (1993, Mel Brooks)
Analyse Genomic At Scale
Spark, Adam, Spark Notebook
• Sharp intro to Genomics data
• What are the Challenges
• Distributed Machine Learning to the rescue
What is genomics data?
DNA?
What makes us what we are…
… a complex biochemical soup.
With applications to medical diagnostics, drug response,
disease mechanisms
On the production side
Fast biotech progress…
… can IT keep up?
On the production side
Sequence {A, T, G, C}
3 billion characters (bases)
On the production side
Sequence {A, T, G, C}
3 billion characters (bases)
… x 30 (x 60)
Massively parallel
Lots of data?
Lots of data?
10’s millions
Lots of data!
10’s millions
1,000s
1,000,000s
...
ADAM: Spark genomics library
http://www.bdgenomics.org
ADAM: Spark genomics library
ADAM: Spark genomics library
ADAM: Spark genomics library
ADAM: Spark genomics library
Avro schema
Parquet storage
Genomics API
So what do we do with this?
Study variations between populations
Descriptive statistics
Machine Learning (Population stratification or Supervised
learning)
… and share and replay!
The Spark Notebook
… comes to the rescue.
Spark: easy APIs
Self described and consistent
Easily shared (code)
http://www.spark-notebook.io
The Spark Notebook
The Spark Notebook
The Spark Notebook
So what do we do with this?
… and share and replay!
Code can be shared easily but we want better...
How do we share data produced by the notebook?
How do we publish the notebook as a service?
Share Genomic At Scale
Spark, Tachyon, Mesos, Shar3
• Projects: Distributed teams
• Research: Long process
• Towards Maximum Share for efficiency
Projects
Intrinsically involving many teams
geolocally distributed in different
countries or laboratories
with different skills in
Biology, Genetics, I.T., Medicine (, legal...)
Projects
Require many types of data ranging from
bio samples
imagery
textual
archives/historical
Projects
Of course
Generally gather many people from several populations
Note: This is very expensive and burns time as hell!
Projects
1.000 genomes (2008-2012): 200To
100.000 genomes (2013-2017): 20Po (probably more)
1.000.000 genomes (2016-2020): 0.2Eo (probably more)
eQTL: mixing many sources
Projects
Need proper data management between entities, yet
coping with:
amount of data
heterogeneity of people
distance between actors
constraints related to data
location
Projects
Distributed friendly
SCHEMAS + BINARY
f.i. Avro
Research
Research in medicine or health in general is
LOOOOOOO…OOOOONG
Research
Most reasons are quite obvious not have to be overlooked
Lots of measures and validation
Lots of control (including by Gov.)
Lots of actors
Research
As a matter of fact, research need
to be conducted on data and
to produce results
But both are highly exposed to reuse, so what if we lose
either of them?
Research
However, we can get into troubles instantly without even
losing them.
What if we don’t track the processes to go from one to the
other?
In any scientific process: confrontation, replay and
enhancement are key to move forward
This is misleading to think that sharing the code is enough.
Remind: we look for data and results, not for code.
The process includes the code, the context, the sources
and so on, and all should be part of the data
discovery/validation task
Research
Assess the risk factor associated with a disease given
mutations of a certain gene.
More than 50 years of data collecting and modelling.
Hundreds of researchers, each generation with new ideas.
Replaying old processes on new data,
new processes on old data
Research
Share share
share
All these facts relate to our capacity to share our work and
to collaborate.
We need to share efficiently and accurately
• data
• process
• results
Share share
share
The challenge resides in the workflow
Share share
share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types,…)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
Share share
share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types,…)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
Share share
share
Streamlining development lifecycle
for better Productivity
with Shar3
Share share
share
Analysis
Production
DistributionRendering
Discovery
Catalog
Project
Generator
Micro Service /
Binary format
Schema for output
Metadata
That’s all folks
Thanks for listening/staying
Poke us on Twitter or via http://data-fellas.guru
@DataFellas
@Shar3_Fellas
@SparkNotebook
@Xtordoir & @Noootsab
Check also @TypeSafe: http://t.co/o1Bt6dQtgH

Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir

  • 1.
    Share and analysegenomic data at scale with Spark, Adam, Tachyon & the Spark Notebook by @DataFellas, Oct • 29th • 2015
  • 2.
    Outline • Sharp introto Genomics data • What are the Challenges • Distributed Machine Learning to the rescue • Projects: Distributed teams • Research: Long process • Towards Maximum Share for efficiency
  • 3.
    Andy Petrella Maths Geospatial Distributed Computing SparkNotebook Trainer Spark/Scala Machine Learning Xavier Tordoir Physics Bioinformatics Distributed Computing Scala (& Perl) trainer Spark Machine Learning “There must be another way of doingthecredits” -- Robin Hood: Menin Tights (1993, Mel Brooks)
  • 4.
    Analyse Genomic AtScale Spark, Adam, Spark Notebook • Sharp intro to Genomics data • What are the Challenges • Distributed Machine Learning to the rescue
  • 5.
    What is genomicsdata? DNA? What makes us what we are… … a complex biochemical soup. With applications to medical diagnostics, drug response, disease mechanisms
  • 6.
    On the productionside Fast biotech progress… … can IT keep up?
  • 7.
    On the productionside Sequence {A, T, G, C} 3 billion characters (bases)
  • 8.
    On the productionside Sequence {A, T, G, C} 3 billion characters (bases) … x 30 (x 60) Massively parallel
  • 9.
  • 10.
  • 11.
    Lots of data! 10’smillions 1,000s 1,000,000s ...
  • 12.
    ADAM: Spark genomicslibrary http://www.bdgenomics.org
  • 13.
  • 14.
  • 15.
  • 16.
    ADAM: Spark genomicslibrary Avro schema Parquet storage Genomics API
  • 17.
    So what dowe do with this? Study variations between populations Descriptive statistics Machine Learning (Population stratification or Supervised learning) … and share and replay!
  • 18.
    The Spark Notebook …comes to the rescue. Spark: easy APIs Self described and consistent Easily shared (code) http://www.spark-notebook.io
  • 19.
  • 20.
  • 21.
  • 22.
    So what dowe do with this? … and share and replay! Code can be shared easily but we want better... How do we share data produced by the notebook? How do we publish the notebook as a service?
  • 23.
    Share Genomic AtScale Spark, Tachyon, Mesos, Shar3 • Projects: Distributed teams • Research: Long process • Towards Maximum Share for efficiency
  • 24.
    Projects Intrinsically involving manyteams geolocally distributed in different countries or laboratories with different skills in Biology, Genetics, I.T., Medicine (, legal...)
  • 25.
    Projects Require many typesof data ranging from bio samples imagery textual archives/historical
  • 26.
    Projects Of course Generally gathermany people from several populations Note: This is very expensive and burns time as hell!
  • 27.
    Projects 1.000 genomes (2008-2012):200To 100.000 genomes (2013-2017): 20Po (probably more) 1.000.000 genomes (2016-2020): 0.2Eo (probably more) eQTL: mixing many sources
  • 28.
    Projects Need proper datamanagement between entities, yet coping with: amount of data heterogeneity of people distance between actors constraints related to data location
  • 29.
  • 30.
    Research Research in medicineor health in general is LOOOOOOO…OOOOONG
  • 31.
    Research Most reasons arequite obvious not have to be overlooked Lots of measures and validation Lots of control (including by Gov.) Lots of actors
  • 32.
    Research As a matterof fact, research need to be conducted on data and to produce results But both are highly exposed to reuse, so what if we lose either of them?
  • 33.
    Research However, we canget into troubles instantly without even losing them. What if we don’t track the processes to go from one to the other? In any scientific process: confrontation, replay and enhancement are key to move forward
  • 34.
    This is misleadingto think that sharing the code is enough. Remind: we look for data and results, not for code. The process includes the code, the context, the sources and so on, and all should be part of the data discovery/validation task Research
  • 35.
    Assess the riskfactor associated with a disease given mutations of a certain gene. More than 50 years of data collecting and modelling. Hundreds of researchers, each generation with new ideas. Replaying old processes on new data, new processes on old data Research
  • 36.
    Share share share All thesefacts relate to our capacity to share our work and to collaborate. We need to share efficiently and accurately • data • process • results
  • 37.
    Share share share The challengeresides in the workflow
  • 38.
    Share share share “Create” Cluster Findsources (context, quality, semantic, …) Connect to sources (structure, schema/types,…) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
  • 39.
    Share share share “Create” Cluster Findsources (context, quality, semantic, …) Connect to sources (structure, schema/types,…) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
  • 40.
    Share share share Streamlining developmentlifecycle for better Productivity with Shar3
  • 41.
  • 42.
    That’s all folks Thanksfor listening/staying Poke us on Twitter or via http://data-fellas.guru @DataFellas @Shar3_Fellas @SparkNotebook @Xtordoir & @Noootsab Check also @TypeSafe: http://t.co/o1Bt6dQtgH