by Data Fellas,
Data Enthusiasts v 4.0 (July, 13th ‘15)
Scalable and Interoperable data services
Applied to Genomics
Young Belgian Startup
The Data Fellas Startup
Data Science
Xavier Tordoir
@xtordoir
Andy Petrella
@noootsab
Data Processing
Scalable Machine Learning
Micro Services oriented
Data Fellas Ecosystem
We’ve worked with
Data Fellas: Evangelizing
Training
Scala
Apache Spark (BE, in September)
http://spark4devs.data-fellas.guru/
Distributed Machine Learning
Pipeline (Oakland, August)
http://bigdatascala.bythebay.io/training.html
Apache Spark
(SFO with BoldRadius, August)
Talks
Scala IO, Devoxx Belgium,
Devoxx France, Scala Days, KTH,
KUL, Spark Meetup London, …
more to come (Italy, …)
PMC Member at Strata NY
PMC member at Devoxx
PMC Member at Foss4G
First: Data Science
Analysis
Spark Notebook
First: Data Science
Analysis
Production
Project Generator
Mesos / C* / DCOS
First: Data Science
Analysis
Production
Distribution
Micro Service /
Binary format
Marathon
First: Data Science
Analysis
Production
DistributionRendering
SChema for output
GG / D3 …
First: Data Science
Analysis
Production
DistributionRendering
Discovery
Service Metadata
SOLR , …
First: Data Science
Analysis
Production
DistributionRendering
Discovery
Catalog
Spark Notebook
using Services
too
First: Data Science
Analysis
Production
DistributionRendering
Discovery
Share
Analyses
Share
Results
Share
Datasets
First: Data Science
Project Code Name:
Shar3
Next: Applied TO Genomics
Genomics data is pretty big
● 100,000’s genomes in 2015
● 1,000,000’s …
● 100,000,000’s …
● …
Next: Applied TO Genomics
Genomics data is pretty big and of High dimensionality
One genome:
○ 3 billions bases (basic DNA component) sequence
○ 30 - 60 x coverage for quality
○ 10’s to 100’s millions variants (variable bases
from one individual to the next)
Next: Applied TO Genomics
e.g. 1000genomes project:
● 200TB compressed data
● organised in files/directories
● data formatted following specs in a … PDF
Data and services schemas are required
What we do with genomics data?
Lots of Querying and Learning:
E.G.
● Population structure is a fundamental basis
● Querying relationships between genomes and other
biological features
Hey… no one has all data!
Metadata
What we do with genomics data?
Lots of Querying and Learning:
E.G.
● We do some specific Modelling on some data…
Hey… no two serve the same computations!
Service Discovery
Interoperability
So, no one has all data …
BUT all should be able to talk…
Interoperability (GA4GH)
Interoperable…
Analysis
Production
DistributionRendering
Discovery
Share
Analyses
Share
Results
Share
Datasets
Interoperable & scalable…
GA4GH + Shar3 = Med@Scale
+ ADAM & spark
+ In Memory optimization (Tachyon)
+ Deployment (e.g. DCOS)
Wrap-UP
Follow us @DataFellas and get notified about our
+ sharing platform at scale: Shar3
+ Google Genomics At Home (^.^): Med@Scale
+ future plans: modules for Trading, Geospatial,
other medical data, …
References
Adam: https://github.com/bigdatagenomics/adam
Bdg-Formats: https://github.com/bigdatagenomics/bdg-formats
GA4GH website: http://genomicsandhealth.org/
GA4GH data working group: http://ga4gh.org/
@Spark-Notebook: https://github.com/andypetrella/spark-notebook/
Med-At-Scale: https://github.com/med-at-scale/high-health
Data Fellas: http://data-fellas.guru/
Training: http://spark4devs.data-fellas.guru/
Q/A
THANKS!

Data Enthusiasts London: Scalable and Interoperable data services. Applied to Genomics