SlideShare a Scribd company logo
1 of 43
Download to read offline
Share and analyse genomic data
at scale
with Spark, Adam, Tachyon & the Spark Notebook
by @DataFellas, Oct • 29th • 2015
Outline
● Sharp intro to Genomics data
● What are the Challenges
● Distributed Machine Learning to the rescue
● Projects: Distributed teams
● Research: Long process
● Towards Maximum Share for efficiency
Andy Petrella
Maths
Geospatial
Distributed Computing
Spark Notebook
Trainer Spark/Scala
Machine Learning
Xavier Tordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
trainer Spark
Machine Learning
“There must be another way of doing the credits” -- Robin Hood: Men in Tights (1993, Mel Brooks)
Analyse Genomic At Scale
Spark, Adam, Spark Notebook
➔ Sharp intro to Genomics data
➔ What are the Challenges
➔ Distributed Machine Learning to the rescue
What is genomics data?
DNA?
What makes us what we are…
… a complex biochemical soup.
With applications to medical diagnostics, drug response,
disease mechanisms
What is genomics data?
DNA?
What makes us what we are…
… a complex biochemical soup.
With applications to medical diagnostics, drug response,
disease mechanisms
On the production side
Fast biotech progress…
… can IT keep up?
On the production side
Sequence {A, T, G, C}
3 billion characters (bases)
On the production side
Sequence {A, T, G, C}
3 billion characters (bases)
… x 30 (x 60)
Massively parallel
Lots of data?
Lots of data?
10’s millions
Lots of data!
10’s millions
1,000s
1,000,000s
...
ADAM: Spark genomics library
http://www.bdgenomics.org
Matt Massie
Frank Nothaft
ADAM: Spark genomics library
ADAM: Spark genomics library
ADAM: Spark genomics library
ADAM: Spark genomics library
Avro schema
Parquet storage
Genomics API
So what do we do with this?
Study variations between populations
Descriptive statistics
Machine Learning (Population stratification or Supervised
learning)
… and share and replay!
The Spark Notebook
… comes to the rescue.
+ Self described and consistent
+ Easily shared (code)
+ Scala (types, production quality)
+ Reactive&pluggage charts API (scala = no.js)
+ easy install, no deps.
+ multiple sparkContext
http://www.spark-notebook.io
The Spark Notebook
The Spark Notebook
The Spark Notebook
So what do we do with this?
… and share and replay!
Code can be shared easily but we want more...
How do we share data produced by the notebook?
How do we publish the notebook as a service?
Share Genomic At Scale
Spark, Tachyon, Mesos, Shar3
➔ Projects: Distributed teams
➔ Research: Long process
➔ Towards Maximum Share for efficiency
Projects
Intrinsically involving many teams
geolocally distributed in different
countries or laboratories
with different skills in
Biology, Genetics, I.T., Medicine (, legal...)
Projects
Require many types of data ranging from
bio samples
imagery
textual
archives/historical
Projects
Of course
Generally gather many people from several populations
Note: This is very expensive and burns $time as hell!
Projects
1.000 genomes (2008-2012): 200To
100.000 genomes (2013-2017): 20Po (probably more)
1.000.000 genomes (2016-2020): 0.2Eo (probably more)
eQTL: mixing many sources
Projects
Need proper data management between entities, yet
coping with:
amount of data
heterogeneity of people
distance between actors
constraints related to data location
Projects
Distributed friendly
SCHEMAS + BINARY
f.i. Avro
Research
Research in medicine or health in general is
LOOOOOOO…OOOOONG
Research
Most reasons are quite obvious and must not be overlooked
Lots of measures and validation
Lots of control (including by Gov.)
Lots of actors
Research
As a matter of fact, research needs
to be conducted on data and
to produce results
And both are extremely exposed to reuse
So what if we lose either of them?
Research
However, we can get into troubles instantly
without even losing them!
What if we don’t track the processes?
In any scientific process: confrontation, replay and
enhancement are keys to move forward
This is misleading to think that sharing the code is enough.
Remind: we look for data and results, not for code.
The process includes the code, the context, the sources
and so on, and all should be part of the data
discovery/validation task
Research
Assess the risk factor associated with a disease given
mutations of a certain gene.
More than 50 years of data collecting and modelling.
Hundreds of researchers, each generation has new ideas.
Replaying old processes on new data,
new processes on old data
Research
Share share
share
All these facts relate to our capacity to share our work and
to collaborate.
We need to share efficiently and accurately the
★ data
★ processes
★ results
Share share
share
The challenge resides in the workflow
Share share
share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
Share share
share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
Share share
share
Streamlining development lifecycle
for better Productivity
with Shar3
Share share
share
Analysis
Production
DistributionRendering
Discovery
Catalog
Project
Generator
Micro Service /
Binary format
Schema for output
Metadata
That’s all folks
Thanks for listening/staying
Poke us on Twitter or via http://data-fellas.guru
@DataFellas @Shar3_Fellas @SparkNotebook
@Xtordoir & @Noootsab
Building Distributed Pipelines for Data Science using
Kafka, Spark, and Cassandra (form → @DataFellas)
Check also @TypeSafe: http://t.co/o1Bt6dQtgH

More Related Content

What's hot

Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadofnothaft
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Timothy Danford
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteMatt Massie
 
Strata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationStrata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationTimothy Danford
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaDatabricks
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
From Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleFrom Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleDatabricks
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Databricks
 
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB
 

What's hot (20)

Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Genome Big Data
Genome Big DataGenome Big Data
Genome Big Data
 
Spark Summit East 2015
Spark Summit East 2015Spark Summit East 2015
Spark Summit East 2015
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Strata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationStrata-Hadoop 2015 Presentation
Strata-Hadoop 2015 Presentation
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
From Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleFrom Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at Scale
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...
 
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
 

Viewers also liked

Overcoming barriers for genomic data sharing yaac presentation may 23 2015
Overcoming barriers for genomic data sharing   yaac presentation may 23 2015Overcoming barriers for genomic data sharing   yaac presentation may 23 2015
Overcoming barriers for genomic data sharing yaac presentation may 23 2015Fiona Nielsen
 
Legal and regulatory challenges to data sharing for clinical genetics and ge...
Legal and regulatory challenges to  data sharing for clinical genetics and ge...Legal and regulatory challenges to  data sharing for clinical genetics and ge...
Legal and regulatory challenges to data sharing for clinical genetics and ge...Human Variome Project
 
Nci clinical genomics data sharing ncra sept 2016
Nci clinical genomics data sharing ncra sept 2016Nci clinical genomics data sharing ncra sept 2016
Nci clinical genomics data sharing ncra sept 2016Warren Kibbe
 
Data Driven Business Model: le opportunità di monetizzazione
Data Driven Business Model: le opportunità  di monetizzazioneData Driven Business Model: le opportunità  di monetizzazione
Data Driven Business Model: le opportunità di monetizzazioneData Driven Innovation
 
BigData: una nuova fonte per la ricerca storica
BigData: una nuova fonte per la ricerca storicaBigData: una nuova fonte per la ricerca storica
BigData: una nuova fonte per la ricerca storicaData Driven Innovation
 
Language Translation re-invented with Big Data
Language Translation re-invented with Big DataLanguage Translation re-invented with Big Data
Language Translation re-invented with Big DataData Driven Innovation
 
Data Driven UX - From Social networks to target audience
Data Driven UX - From Social networks to target audienceData Driven UX - From Social networks to target audience
Data Driven UX - From Social networks to target audienceData Driven Innovation
 
4th industrial revolution – impact of data on the real world
4th industrial revolution – impact of data on the real world4th industrial revolution – impact of data on the real world
4th industrial revolution – impact of data on the real worldData Driven Innovation
 
INDUSTRIA 4.0 - Il trasferimento tecnologico attraverso i Digital Innovation ...
INDUSTRIA 4.0 - Il trasferimento tecnologico attraverso i Digital Innovation ...INDUSTRIA 4.0 - Il trasferimento tecnologico attraverso i Digital Innovation ...
INDUSTRIA 4.0 - Il trasferimento tecnologico attraverso i Digital Innovation ...Data Driven Innovation
 
Il valore delle Indicazioni Geografiche nell'economia italiana - Mauro Rosati
Il valore delle Indicazioni Geografiche nell'economia italiana - Mauro RosatiIl valore delle Indicazioni Geografiche nell'economia italiana - Mauro Rosati
Il valore delle Indicazioni Geografiche nell'economia italiana - Mauro RosatiData Driven Innovation
 
In Codice Ratio: analisi data driven di fonti storiche - P. Merialdo & A. Rossi
In Codice Ratio: analisi data driven di fonti storiche - P. Merialdo & A. RossiIn Codice Ratio: analisi data driven di fonti storiche - P. Merialdo & A. Rossi
In Codice Ratio: analisi data driven di fonti storiche - P. Merialdo & A. RossiData Driven Innovation
 
Enhanced site search with cognitive APIs - Glynn Bird
Enhanced site search with cognitive APIs - Glynn BirdEnhanced site search with cognitive APIs - Glynn Bird
Enhanced site search with cognitive APIs - Glynn BirdData Driven Innovation
 
Innovazione per la PA - Andrea D'Acunto
Innovazione per la PA - Andrea D'AcuntoInnovazione per la PA - Andrea D'Acunto
Innovazione per la PA - Andrea D'AcuntoData Driven Innovation
 
LCA as an innovation tool - Barilla - Luca Ruini
LCA as an innovation tool - Barilla - Luca RuiniLCA as an innovation tool - Barilla - Luca Ruini
LCA as an innovation tool - Barilla - Luca RuiniData Driven Innovation
 
L’etica nella società dell’intelligenza artificiale - Edmondo Grassi
L’etica nella società dell’intelligenza artificiale - Edmondo GrassiL’etica nella società dell’intelligenza artificiale - Edmondo Grassi
L’etica nella società dell’intelligenza artificiale - Edmondo GrassiData Driven Innovation
 
Holographic Data Visualization - M. Valoriani & A. Musone
Holographic Data Visualization - M. Valoriani & A. MusoneHolographic Data Visualization - M. Valoriani & A. Musone
Holographic Data Visualization - M. Valoriani & A. MusoneData Driven Innovation
 
Towards intelligent data insights in central banks: challenges and opportunit...
Towards intelligent data insights in central banks: challenges and opportunit...Towards intelligent data insights in central banks: challenges and opportunit...
Towards intelligent data insights in central banks: challenges and opportunit...Data Driven Innovation
 

Viewers also liked (20)

Overcoming barriers for genomic data sharing yaac presentation may 23 2015
Overcoming barriers for genomic data sharing   yaac presentation may 23 2015Overcoming barriers for genomic data sharing   yaac presentation may 23 2015
Overcoming barriers for genomic data sharing yaac presentation may 23 2015
 
Legal and regulatory challenges to data sharing for clinical genetics and ge...
Legal and regulatory challenges to  data sharing for clinical genetics and ge...Legal and regulatory challenges to  data sharing for clinical genetics and ge...
Legal and regulatory challenges to data sharing for clinical genetics and ge...
 
Ingesting click events for analytics
Ingesting click events for analyticsIngesting click events for analytics
Ingesting click events for analytics
 
Nci clinical genomics data sharing ncra sept 2016
Nci clinical genomics data sharing ncra sept 2016Nci clinical genomics data sharing ncra sept 2016
Nci clinical genomics data sharing ncra sept 2016
 
Genomic Data Analysis
Genomic Data AnalysisGenomic Data Analysis
Genomic Data Analysis
 
Data Driven Business Model: le opportunità di monetizzazione
Data Driven Business Model: le opportunità  di monetizzazioneData Driven Business Model: le opportunità  di monetizzazione
Data Driven Business Model: le opportunità di monetizzazione
 
Data culture
Data cultureData culture
Data culture
 
BigData: una nuova fonte per la ricerca storica
BigData: una nuova fonte per la ricerca storicaBigData: una nuova fonte per la ricerca storica
BigData: una nuova fonte per la ricerca storica
 
Language Translation re-invented with Big Data
Language Translation re-invented with Big DataLanguage Translation re-invented with Big Data
Language Translation re-invented with Big Data
 
Data Driven UX - From Social networks to target audience
Data Driven UX - From Social networks to target audienceData Driven UX - From Social networks to target audience
Data Driven UX - From Social networks to target audience
 
4th industrial revolution – impact of data on the real world
4th industrial revolution – impact of data on the real world4th industrial revolution – impact of data on the real world
4th industrial revolution – impact of data on the real world
 
INDUSTRIA 4.0 - Il trasferimento tecnologico attraverso i Digital Innovation ...
INDUSTRIA 4.0 - Il trasferimento tecnologico attraverso i Digital Innovation ...INDUSTRIA 4.0 - Il trasferimento tecnologico attraverso i Digital Innovation ...
INDUSTRIA 4.0 - Il trasferimento tecnologico attraverso i Digital Innovation ...
 
Il valore delle Indicazioni Geografiche nell'economia italiana - Mauro Rosati
Il valore delle Indicazioni Geografiche nell'economia italiana - Mauro RosatiIl valore delle Indicazioni Geografiche nell'economia italiana - Mauro Rosati
Il valore delle Indicazioni Geografiche nell'economia italiana - Mauro Rosati
 
In Codice Ratio: analisi data driven di fonti storiche - P. Merialdo & A. Rossi
In Codice Ratio: analisi data driven di fonti storiche - P. Merialdo & A. RossiIn Codice Ratio: analisi data driven di fonti storiche - P. Merialdo & A. Rossi
In Codice Ratio: analisi data driven di fonti storiche - P. Merialdo & A. Rossi
 
Enhanced site search with cognitive APIs - Glynn Bird
Enhanced site search with cognitive APIs - Glynn BirdEnhanced site search with cognitive APIs - Glynn Bird
Enhanced site search with cognitive APIs - Glynn Bird
 
Innovazione per la PA - Andrea D'Acunto
Innovazione per la PA - Andrea D'AcuntoInnovazione per la PA - Andrea D'Acunto
Innovazione per la PA - Andrea D'Acunto
 
LCA as an innovation tool - Barilla - Luca Ruini
LCA as an innovation tool - Barilla - Luca RuiniLCA as an innovation tool - Barilla - Luca Ruini
LCA as an innovation tool - Barilla - Luca Ruini
 
L’etica nella società dell’intelligenza artificiale - Edmondo Grassi
L’etica nella società dell’intelligenza artificiale - Edmondo GrassiL’etica nella società dell’intelligenza artificiale - Edmondo Grassi
L’etica nella società dell’intelligenza artificiale - Edmondo Grassi
 
Holographic Data Visualization - M. Valoriani & A. Musone
Holographic Data Visualization - M. Valoriani & A. MusoneHolographic Data Visualization - M. Valoriani & A. Musone
Holographic Data Visualization - M. Valoriani & A. Musone
 
Towards intelligent data insights in central banks: challenges and opportunit...
Towards intelligent data insights in central banks: challenges and opportunit...Towards intelligent data insights in central banks: challenges and opportunit...
Towards intelligent data insights in central banks: challenges and opportunit...
 

Similar to Spark Summit Europe: Share and analyse genomic data at scale

Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Dataconomy Media
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science Carole Goble
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009Ian Foster
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformaticsc.titus.brown
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect WorldVital.AI
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EITESANGO
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dcc.titus.brown
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceCarole Goble
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsAnita de Waard
 

Similar to Spark Summit Europe: Share and analyse genomic data at scale (20)

Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics d...
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Reproducible Research and the Cloud
Reproducible Research and the CloudReproducible Research and the Cloud
Reproducible Research and the Cloud
 
Services For Science April 2009
Services For Science April 2009Services For Science April 2009
Services For Science April 2009
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect World
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 

More from Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
 
Governance compliance
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open ScienceAndy Petrella
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Quanti-litative Revolution in GIS
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GISAndy Petrella
 
Scala and-fp-in-big-data
Scala and-fp-in-big-dataScala and-fp-in-big-data
Scala and-fp-in-big-dataAndy Petrella
 

More from Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Quanti-litative Revolution in GIS
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GIS
 
Scala and-fp-in-big-data
Scala and-fp-in-big-dataScala and-fp-in-big-data
Scala and-fp-in-big-data
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 

Spark Summit Europe: Share and analyse genomic data at scale

  • 1. Share and analyse genomic data at scale with Spark, Adam, Tachyon & the Spark Notebook by @DataFellas, Oct • 29th • 2015
  • 2. Outline ● Sharp intro to Genomics data ● What are the Challenges ● Distributed Machine Learning to the rescue ● Projects: Distributed teams ● Research: Long process ● Towards Maximum Share for efficiency
  • 3. Andy Petrella Maths Geospatial Distributed Computing Spark Notebook Trainer Spark/Scala Machine Learning Xavier Tordoir Physics Bioinformatics Distributed Computing Scala (& Perl) trainer Spark Machine Learning “There must be another way of doing the credits” -- Robin Hood: Men in Tights (1993, Mel Brooks)
  • 4. Analyse Genomic At Scale Spark, Adam, Spark Notebook ➔ Sharp intro to Genomics data ➔ What are the Challenges ➔ Distributed Machine Learning to the rescue
  • 5. What is genomics data? DNA? What makes us what we are… … a complex biochemical soup. With applications to medical diagnostics, drug response, disease mechanisms
  • 6. What is genomics data? DNA? What makes us what we are… … a complex biochemical soup. With applications to medical diagnostics, drug response, disease mechanisms
  • 7. On the production side Fast biotech progress… … can IT keep up?
  • 8. On the production side Sequence {A, T, G, C} 3 billion characters (bases)
  • 9. On the production side Sequence {A, T, G, C} 3 billion characters (bases) … x 30 (x 60) Massively parallel
  • 12. Lots of data! 10’s millions 1,000s 1,000,000s ...
  • 13. ADAM: Spark genomics library http://www.bdgenomics.org Matt Massie Frank Nothaft
  • 17. ADAM: Spark genomics library Avro schema Parquet storage Genomics API
  • 18. So what do we do with this? Study variations between populations Descriptive statistics Machine Learning (Population stratification or Supervised learning) … and share and replay!
  • 19. The Spark Notebook … comes to the rescue. + Self described and consistent + Easily shared (code) + Scala (types, production quality) + Reactive&pluggage charts API (scala = no.js) + easy install, no deps. + multiple sparkContext http://www.spark-notebook.io
  • 23. So what do we do with this? … and share and replay! Code can be shared easily but we want more... How do we share data produced by the notebook? How do we publish the notebook as a service?
  • 24. Share Genomic At Scale Spark, Tachyon, Mesos, Shar3 ➔ Projects: Distributed teams ➔ Research: Long process ➔ Towards Maximum Share for efficiency
  • 25. Projects Intrinsically involving many teams geolocally distributed in different countries or laboratories with different skills in Biology, Genetics, I.T., Medicine (, legal...)
  • 26. Projects Require many types of data ranging from bio samples imagery textual archives/historical
  • 27. Projects Of course Generally gather many people from several populations Note: This is very expensive and burns $time as hell!
  • 28. Projects 1.000 genomes (2008-2012): 200To 100.000 genomes (2013-2017): 20Po (probably more) 1.000.000 genomes (2016-2020): 0.2Eo (probably more) eQTL: mixing many sources
  • 29. Projects Need proper data management between entities, yet coping with: amount of data heterogeneity of people distance between actors constraints related to data location
  • 31. Research Research in medicine or health in general is LOOOOOOO…OOOOONG
  • 32. Research Most reasons are quite obvious and must not be overlooked Lots of measures and validation Lots of control (including by Gov.) Lots of actors
  • 33. Research As a matter of fact, research needs to be conducted on data and to produce results And both are extremely exposed to reuse So what if we lose either of them?
  • 34. Research However, we can get into troubles instantly without even losing them! What if we don’t track the processes? In any scientific process: confrontation, replay and enhancement are keys to move forward
  • 35. This is misleading to think that sharing the code is enough. Remind: we look for data and results, not for code. The process includes the code, the context, the sources and so on, and all should be part of the data discovery/validation task Research
  • 36. Assess the risk factor associated with a disease given mutations of a certain gene. More than 50 years of data collecting and modelling. Hundreds of researchers, each generation has new ideas. Replaying old processes on new data, new processes on old data Research
  • 37. Share share share All these facts relate to our capacity to share our work and to collaborate. We need to share efficiently and accurately the ★ data ★ processes ★ results
  • 38. Share share share The challenge resides in the workflow
  • 39. Share share share “Create” Cluster Find sources (context, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
  • 40. Share share share “Create” Cluster Find sources (context, quality, semantic, …) Connect to sources (structure, schema/types, …) Create distributed data pipeline/Model Tune accuracy Tune performances Write results to Sinks Access Layer User Access ops data ops data sci sci ops sci ops data web ops data web ops data sci
  • 41. Share share share Streamlining development lifecycle for better Productivity with Shar3
  • 43. That’s all folks Thanks for listening/staying Poke us on Twitter or via http://data-fellas.guru @DataFellas @Shar3_Fellas @SparkNotebook @Xtordoir & @Noootsab Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas) Check also @TypeSafe: http://t.co/o1Bt6dQtgH