Sparkling Water on the Spark
Notebook: Interactive Genomes
clustering
Why you must care, by Data Fellas
Xavier Tordoir
xtordoir@data-fellas.guru
@xtordoir
● Apache Spark
● Interactivity: Spark notebook
● Genomics on Spark: ADAM
● Data exploitation
● H2O w/ Spark: Sparkling water
● Show time
● Streamlining dev/deployment
Lineup
Can’t wait!
Data Fellas
Andy Petrella
Maths
Geospatial
Distributed Computing
Spark Notebook
Trainer Spark/Scala
Machine Learning
Xavier Tordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
trainer Spark
Machine Learning
Distributed computing framework
Large Scale Data Processing engine
I play BIG!
What is Apache
Spark?
Distributed computing framework
Large Scale Data Processing engine
● SQL & Dataframes
● Streaming
● Graph Processing
● Machine Learning
With all
colors!
What is Apache
Spark?
Distributed computing framework
Large Scale Data Processing engine
● Optimize memory usage (FAST)
● Optimize computation execution
(Complex tasks)
● Easy programming model
Checking in cache
If I remember...
What is Apache
Spark?
Distributed computing framework
Large Scale Data Processing engine
● Interactive
● @ any scale
http://spark-notebook.io
Laurel? HArdy?
Anyone?
What is Apache
Spark?
● Scala (types, production quality)
● Reactive&pluggable charts API
(scala = no.js)
● easy install, no deps.
● multiple sparkContext
out of the box.
What is Apache
Spark?
http://bdgenomics.org/
ADAM Project (UC Berkeley):
● Data format (schema, compact,
distributed): avro + parquet
● API (Reads, Variants, Genotypes, …)
I, ADAM
Genomics with
Spark?
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
Genomics
The data
Please, don’t mind
the colors...
Genomics
The data
So… that’s what
separates us huh?
1000 genomes: http://www.1000genomes.org/
~1000 samples
Few samples => Machine Learning
Genomics
The data
Woooow, really, you
must be kidding
me… ahahahahah
1000 genomes: http://www.1000genomes.org/
~1000 samples
~30M Genotypes per sample (features)
Few samples => Machine Learning
Lots of Data => Distributed computing
Genomics
The data
Oh… damned… hum
huh
Population stratification
w/ Deeplearning? H2O
From the spark notebook? Sparkling water
Genomics
The problem
Here I need some
water.
Memory implementation of “Map-Reduce”
Highly optimised structures for the JVM
blazing fast convergent models
H2O
Higher API
H2O
Sparkling: in-memory data exchange
I remember things
better with two
copies in memory.
http://h2o.ai/product/sparkling-water/
Showtime!
press play...
There’s a notebook for that
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
Shar3 (Data Fellas)
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
Shar3 (Data Fellas)
Analysis
Production
DistributionRendering
Discovery
Catalog
Project
Generator
Micro Service /
Binary format
Schema for output
Metadata
Spark and the Notebook are interactive and leverage
distributed computing infrastructure
ADAM is an optimized storage format for Massive
genomic data
Spark provides tools to manipulate data and works
w/ other libraries like H2O
Data scientists and application developers can work
together
Summary
Wake up, we’re back!
Acknowledgements
Frank Nothaft
Matt Massie
Neil Fergusson
Vinod & Michal
Thank you For your attention!
Questions?
And now let’s talk.

H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clustering - Xavier Tordoir