SlideShare a Scribd company logo
1 of 78
Download to read offline
Welcome & Todayā€™s Schedule
Registration and breakfast
How I learned to stop worrying and love version control ā€“ Dr Stephen J
Newhouse and Luke Marsden
Effective computing for research reproducibility ā€“ Dr Laura Fortunato
Morning Break
A crazy little thing called reproducible science ā€“ Dr Tania Allard
Machine Learning in Production - A practical approach to continuous
deployment of Machine Learning pipelines ā€“ Luca Palmieri &
Christos Dimitroulas
Lunch
Version Control for your Model, Data and Environment ā€“ Workshop
Networking drinks
09:00
10:00
13:00
14:00 ā€“ 17:30
18:00
10:40
11:20
11:40
12:20
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
Thank you!
Please tweet!
#rapids2018
@getdotmesh
How I learned to stop worrying
and love version control
Steve Newhouse & Luke Marsden
#rapids2018
@getdotmesh
Who am I?
Dr. Stephen J Newhouse
Lead Data Scientist & Senior Bioinformatician
ā‡¢ KCL Department of Biostatistics and Health Informatics
ā‡¢ NIHR Biomedical Research Centre at South London
and Maudsley NHS Foundation Trust
ā‡¢ UCL Institute of Health Informatics & Health Data
Research (HDR) UK
#rapids2018
@getdotmesh
Our (broad) interests...
ā‡¢ From Bench (to Computer) to Bedsideā€¦
ā‡¢ Collaborative & Open Researchā€¦
ā‡¢ Data & knowledge sharingā€¦
ā‡¢ DevOps applied to Health Care/Research
ā‡¢ Personalised Medicine & the Quantified Self
#rapids2018
@getdotmesh
Provenance & Reproducibility
#rapids2018
@getdotmesh
ā€œProvenance is the the Missing Feature
for Rigorous Data Science
Joe Doliner
Co-Founder, CEO of Pachyderm
#rapids2018
@getdotmesh
ā€œ
Hazy: Making it Easier to Build and Maintain Big-data Analytics
The next breakthrough in data analysis may not be in
individual algorithms, but in the ability to rapidly
combine, deploy, and maintain existing algorithms
#rapids2018
@getdotmesh
ā€œThe next breakthrough in data analysis may not be in
individual algorithms, but in the ability to rapidly
combine, deploy, and maintain existing algorithms
In A Fully Reproducible Way And Under
Complete Provenance
Steve Newhouse, RAPIDS 2018
#rapids2018
@getdotmesh
Provenance & Reproducibility
ā‡¢ Provenance: ā€œOrigin of somethingā€¦ā€
ā‡¢ Track & Document the data source and the ā€œmodelsā€
ā‡¢ Who, What, Where, Why, When and How at every stage of ETL,EDA,ML,Reporting
ā‡¢ Captures dependency between data sets: Enables reproducibility
ā‡¢ Results can be traced back to their origins and recomputed from scratch:
+ Good for REPRODUCIBILITY
+ Good Practice (should be BEST PRACTICE)
+ Good for GDPR!
#rapids2018
@getdotmesh
Why RAPIDS is important to me
#rapids2018
@getdotmesh
HDR UK: A new national Institute for
health data science
ā‡¢ Structured and unstructured (e.g.
imaging, text) data for derivation of new
or deep phenotypes
ā‡¢ Adding value at scale to existing
world-leading cohorts in the UK
ā‡¢ Demonstrating system-wide
opportunities for research that
improves quality of care
ā‡¢ Enable large scale, high-throughput
research that combines genomic data
with electronic health records
ā‡¢ Genomics, epigenomics, statistical and
complex genetics, population genetics,
cancer ā€˜omicsā€™, molecular epidemiology
Actionable Health Data Analytics Precision Medicine
#rapids2018
@getdotmesh
HDR UK: A new national Institute for
health data science
ā‡¢ Transform Phase II ā€“ Phase IV clinical
trials including ā€˜real world evidenceā€™
studies
ā‡¢ Towards prevention & early
intervention
ā‡¢ Ability to link health and administrative
datasets across multiple environments
ā‡¢ New technologies, from sensors to
wearable devices to artificial
intelligence
21st Century Trial Design Modernising Public Health
Training Future Leaders in
health data science
#rapids2018
@getdotmesh
Large-scale machine learning & mixed ā€˜omic
analysis strategies for patient care...
#rapids2018
@getdotmesh
Investigator Monitor
Laboratories Data Manager
Analyst
Clinician
Results Clinical Data
Itā€™s All About
The Data
and
#datasaveslives
#rapids2018
@getdotmesh
How did I learn to stop worrying
and love version control?
#rapids2018
@getdotmesh
I made the transition from Lab
tech to Bioinformatician
#rapids2018
@getdotmesh
The things you
discover... new ways
of working that just
made sense
#rapids2018
@getdotmesh
And the crap we
do/did that
contributed toā€¦
The Reproducibility
Crisis
#rapids2018
@getdotmesh
My new world
ā‡¢ Git, GitHub & READMEs...
ā‡¢ R & Python Notebooks...
ā‡¢ Meta-Data, Ontologies...
ā‡¢ Shared Codeā€¦
ā‡¢ Open Science, Open Dataā€¦
ā‡¢ FAIR Principlesā€¦
ā‡¢ Community!
#rapids2018
@getdotmesh
My old world
ā‡¢ I started in the lab...I was a
Molecular Biologist/Geneticist
ā‡¢ Provenance & Reproducibility ==
The Lab book, Publicationsā€¦
ā‡¢ Data stored on local
Internal/External and University
Drives/HPC
#rapids2018
@getdotmesh
Lab books: RAPIDS the old way
ā‡¢ Record Everything
ā‡¢ Name, Version, Project, Date.
ā‡¢ Materials & Methods..
ā‡¢ Signed off (sometimes)
ā‡¢ Double Checked (sometimes)
ā‡¢ Varying level of Detail: one liners
to prose...
We (Lab folks) are kind of doing it
anyway...with lab books
#rapids2018
@getdotmesh
How we (Basic Academia) often do
Version Control
ā‡¢ Formal Version Control & Data Provenance? Nope, not really
ā‡¢ Documentation/Report: Excel, Word, Powerpoint & Images...cut and paste into lab
book...heavily edited for publication...
ā‡¢ Analysis: GUIs/SPSS, barplots in Excel...never record steps taken or software versions
until publication...
ā‡¢ Data: Local HDD, HPC, Dropbox...often only location is recordedā€¦
ā‡¢ Document & Data versions can often be overwritten or lost and then there is this...
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
There is little to no formal Data/Model provenance
& Version Control: The Story We Tell
Raw
Data
v1 v2 Finalv3
Publication 1
Publication 2
Publication 3
#rapids2018
@getdotmesh
There is little to no formal Data/Model provenance
& Version Control: The Truth...
Raw
Data
my_d
ata
v1.x
x v2 Final
xxx zzz Final
Final
v2xy
Publication 1
Publication 2
Publication 3
The Truth...
#rapids2018
@getdotmesh
In Academia, a lot of folks*
dont do Version Control &
Provenance. If they do, its
haphazard and under duress**
It is not standard operating
procedure!
*Academics/basic researchers: Statisticians, Economists, Bio/health-informaticians, Biologists and
clinicians who can do R/Python/Stata/SAS/SPSS. **Extra work needed in keeping lab books, documenting
everything, cleaning code, sharing code...not used to this way of working
ā€œ
#rapids2018
@getdotmesh
Why?
#rapids2018
@getdotmesh
Culture
Lack of awareness
Lack of education
#rapids2018
@getdotmesh
It is not enforced in many labs
Incentives are not aligned with RAPIDS
Pressure to Publish Quickly
#rapids2018
@getdotmesh
ā€œ[I was] completely unaware of robust
solutions & common best practices
from the Software/DevOps Worldā€¦
So were my supervisors!
#rapids2018
@getdotmesh
The term Provenance was
never mentioned
Replication/Reproducibility
were just catch-phrases
Plus, this was all supposed to be captured in our Lab
booksā€¦ and then in the publication? Right?....
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
I may be a bit cynical but...
#rapids2018
@getdotmesh
Some sobering reading
The public are on to us the ā€“ ā€˜shoddyā€™ scientists
ā‡¢ ā€œToo many of the findings that fill the academic ether are the result of shoddy
experiments or poor analysisā€ / The Economist
ā‡¢ How science goes wrong / The Economist
ā‡¢ Trouble at the lab / The Economist
ā‡¢ Is It Tough Love Time For Science? / The Big Think
ā‡¢ Some of this is purely down to bad data management (& bad practices & lack of
awareness & lack of education & so on and so forthā€¦)
#rapids2018
@getdotmesh
Garbage in, Garbage out (GIGO)
Bad/NO Data Management/Experimental Design/Analysis Plan
ā‡¢ Spurious results/False positives and negatives
ā‡¢ Translational research suffers
ā‡¢ The patients suffer
ā‡¢ Lies are published
ā‡¢ Time & Money wasted (Charity, Public, Privateā€¦)
ā‡¢ There is no real progress
ā‡¢ Serious Legal & Ethical implications: GDPR!
#rapids2018
@getdotmesh
The Reproducibility Crisis: It's not just the
fields of psychology & medicine...
#rapids2018
@getdotmesh
Matthew Hutson said...
ā€œArtificial intelligence faces reproducibility crisisā€ Matthew Hutson, Science 2018
#rapids2018
@getdotmesh
AI faces a reproducibility crisis
ā‡¢ ā€œI think people outside the field might assume that because we have code, reproducibility
is kind of guaranteed,ā€ ā€¦. ā€œFar from it.ā€
ā‡¢ The most basic problem is that researchers often donā€™t share their source code
(and their Data)
ā‡¢ ā€œThe exact way that you run your experiments is full of undocumented assumptions and
decisions,ā€....ā€œA lot of this detail never makes it into papers.ā€
ā‡¢ ā€œNo time to document every hyperparameterā€...
#rapids2018
@getdotmesh
Some common issues
ā‡¢ Common misconception: Only CS/Software Devs need to do it
ā‡¢ Lack of awareness from the Top-down & bottom-up: the lab lead/PI does not know about
GIT and/or has not signed up to OPEN SCIENCE
ā‡¢ Personalities/Culture/Environment: Why should I share? Data Hoardingā€¦
ā‡¢ Fear of being judged, Fear of the unknown, Fear of the command line
ā‡¢ Laziness? - Adding extra steps to their workflow - ā€œI have to do what now???ā€
#rapids2018
@getdotmesh
There is a need for Reproducibility and
Provenance in Data Science
#rapids2018
@getdotmesh
There is a need for Reproducibility and
Provenance in Everything
#rapids2018
@getdotmesh
ā€œOne of the largest sources of error in
[Data] Science results from computing
[and publishing] results from different
versions of the same data set.
#rapids2018
@getdotmesh
And using different versions of the
same softwareā€¦
#rapids2018
@getdotmesh
And using different versions/implementations
of the ā€œsameā€ algorithm
#rapids2018
@getdotmesh
And failing to capture & share all the
steps taken when building your
ML/AI modelā€¦
Seed? Hyperparameters? Training Split? Features?
Precision? Recall? Time of Day?
#rapids2018
@getdotmesh
And failing to capture the state of your
Data at each iteration of the analysis...
#rapids2018
@getdotmesh
The wider community is aware of this
There are solutions
We are getting better at it
#rapids2018
@getdotmesh
Now over to Luke...
(No, not that one)
#rapids2018
@getdotmesh
Who am I?
Luke Marsden
Founder & CEO of dotmesh
ā‡¢ Hacker & entrepreneur
ā‡¢ Developed first storage system & volume plugin
system for Docker
ā‡¢ Kubernetes SIG lead
ā‡¢ Formerly Computer Science @ Oxford
#rapids2018
@getdotmesh
So you want to do reproducible data
science/AI/ML?
What do you need to pin down?
#rapids2018
@getdotmesh
So you want to do reproducible data
science/AI/ML?
Environment
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
So you want to do reproducible data
science/AI/ML?
Environment
Code
Including
parameters
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
So you want to do reproducible data
science/AI/ML?
Environment
Code
Including
parameters
Data
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
How?
#rapids2018
@getdotmesh
Pinning down environment
ā‡¢ In the DevOps world, Docker has been a big hit.
ā‡¢ Docker helps you pin down the execution
environment that your model training (or other
data work) is happening in.
ā‡¢ What is Docker?
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
What is docker?
ā‡¢ Like tiny frozen, runnable copies of your
computer's filesystem - e.g. Python libraries,
Python versions
ā‡¢ You can determine the exact version of all the
dependencies of your data science code
ā‡¢ You can build, ship & run exactly the same thing
anywhereā€¦ your laptop, a cluster, or the cloud
ā‡¢ Dockerfile lets you declare what versions of
things you want; build a dockerfile from a docker
image and push it to a registry
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
Pinning down code
ā‡¢ For decades developers have been version
controlling their code.
ā‡¢ Tools like git are very popular.
#rapids2018
@getdotmesh
What is git?
ā‡¢ git looks kinda scary - but it's worth persisting
ā‡¢ In data science, it's not natural to commit every
time you change anything, e.g. while tuning
parameters...
ā‡¢ ...but you generate results while you're iterating
A version control system. Lets you track
versions of your code and collaborate with
others by commit, clone, push, pullā€¦
Problems:
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
Pinning down data
ā‡¢ Method one: be very very organised
(meticulous folder structure)
+ Never overwrite filesā€¦ backup
frequentlyā€¦ and get your whole team
to do the same
ā‡¢ Method two: use versioned S3 buckets
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
What is S3?
ā‡¢ When you run an experiment, not natural to note
down all the object versions
ā‡¢ You generally care about the version of the
whole bucket, not every single individual object
(but S3 has no such notion)
ā‡¢ You could build a system to track this, but you've
got more important science to be doing...
A scalable filesystem on Amazon Web Services.
Store lots of data quite cheaply. Version your
objects (files) so that you can solve the problem
of data changing "under your feet".
Problems:
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
So you want to track provenance in data
science/AI/ML?
What do you need to pin down?
#rapids2018
@getdotmesh
So you want to track provenance in data
science/AI/ML?
Data A
Data B
Data C Code 2Code 1
Model 1
Model 2
Input
Input Output Output
Output
Input
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
Pinning down data provenance
ā‡¢ CWL, Pachyderm
ā‡¢ Require you to define the data
pipeline up front
If you can record the graph, you can point to any artefact/model and ask "show
me exactly where this came from"... the exact version of the tool which generated
it, what input data that tool used. And the transitive closure thereof.
Possible tools:
ā‡¢ You don't always know the data
pipeline up front
ā‡¢ Often you're figuring it out as you go
along, and it's evolving...
Problems:
#rapids2018
@getdotmesh
One more problem
ā‡¢ Sad fact. People don't care about reproducibility
as much as their day to day work (getting a paper
published, shipping an optimised model to
production, ā€¦)
ā‡¢ Can we introduce reproducibility & provenance to
people while also helping them get work done
faster and more accurately? And collaborate
better with their team?
#rapids2018
@getdotmesh
The sweetener - track summary stats
ā‡¢ How do you track the
progress/performance/results of your models?
Your data science team?
ā‡¢ Answers ranged from ā€œin a google
spreadsheetā€ to ā€œin text fileā€, ā€œon a piece of
paperā€ or even ā€œverballyā€!
ā‡¢ Ideally, integrate summary stats tracking
into a solution...
We asked dozens of data scientists to describe
their workflows and their pain points.
One problem stood outā€¦
#rapids2018
@getdotmesh
If only this was all a bit easier...
#rapids2018
@getdotmesh
The reason we're running this event today
#rapids2018
@getdotmesh
Introducing dotscience
ā‡¢ Tracks environment with Docker
ā‡¢ Tracks data in versioned S3 buckets +
dotmesh filesystem
ā‡¢ Tracks code versions which generate
summary stats in dotmesh + diff against git
ā‡¢ Integrates with Jupyter (RStudio & scripts
coming soon)
Environment
CodeData
Solves reproducibility:
#rapids2018
@getdotmesh
Introducing dotscience
ā‡¢ Builds the provenance graph on the fly
ā‡¢ For any dataset, see what code generated it
as the output of which other code,
transitively
ā‡¢ For any model, see exactly what code
generated it, and what data that model was
trained on
Solves provenance:
Data C
Code 2
Model 1 Model 2
Input
Output Output
#rapids2018
@getdotmesh
Introducing dotscience
ā‡¢ Builds a table and chart of every run.
Snapshots and keeps together:
+ versioned dataset
+ versioned model
+ all model parameters
+ compute environment
ā‡¢ See performance not just of your
work over time, but your whole team.
Solves summary stats tracking:
Who When Parameters Error rate
Alice 2 minutes ago filter_snps=150 60%
Bob 2 hours ago filter_snps=200 30%
Charlie 12 hours ago filter_snps=100 50%
#rapids2018
@getdotmesh
Live demo time!
#rapids2018
@getdotmesh
You can try this yourself this afternoon!
#rapids2018
@getdotmesh
Roadmap for dotscience
ā‡¢ Cloud Storage
ā‡¢ R & RStudio, scripts, 'ds run' CLI support
ā‡¢ Cluster support - Kubernetes
ā‡¢ Spark/HDFS, MLlib
ā‡¢ Slice & dice
ā‡¢ Collaboration
ā‡¢ Search and discovery
ā‡¢ Multi-tenant execution, 1-click cluster installer, local installers
#rapids2018
@getdotmesh
We need
your help!
#rapids2018
@getdotmesh
Thanks, questions?
beta.dotscience.io
slack.dotscience.io
@lmarsden
@s_j_newhouse
#rapids2018
@getdotmesh

More Related Content

Similar to RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control

from_physics_to_data_science
from_physics_to_data_sciencefrom_physics_to_data_science
from_physics_to_data_scienceMartina Pugliese
Ā 
Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Lourdes Verdes-Montenegro
Ā 
Data Science Intro.pptx
Data Science Intro.pptxData Science Intro.pptx
Data Science Intro.pptxPerumalPitchandi
Ā 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxPerumalPitchandi
Ā 
Open Science for sustainability and inclusiveness: the SKA role model
 Open Science for sustainability and inclusiveness: the SKA role model Open Science for sustainability and inclusiveness: the SKA role model
Open Science for sustainability and inclusiveness: the SKA role modelLourdes Verdes-Montenegro
Ā 
The field-guide-to-data-science
The field-guide-to-data-scienceThe field-guide-to-data-science
The field-guide-to-data-scienceBooz Allen Hamilton
Ā 
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...Ari Berman
Ā 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsHugo Bowne-Anderson
Ā 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data ScienceBooz Allen Hamilton
Ā 
Decoding Data Science Job Descriptions
Decoding Data Science Job DescriptionsDecoding Data Science Job Descriptions
Decoding Data Science Job DescriptionsTereza Iofciu
Ā 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data ScienceEMC
Ā 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data CommonsSimon Twigger
Ā 
Open Research methodologies
Open Research methodologiesOpen Research methodologies
Open Research methodologiesjessykate
Ā 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Onlinesfdatascience
Ā 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotechc.titus.brown
Ā 
MedChemica BigData What Is That All About?
MedChemica BigData What Is That All About?MedChemica BigData What Is That All About?
MedChemica BigData What Is That All About?Al Dossetter
Ā 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Fiona Nielsen
Ā 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist? HackerEarth
Ā 
The future isn't what it used to be
The future isn't what it used to beThe future isn't what it used to be
The future isn't what it used to beTim Suther
Ā 

Similar to RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control (20)

from_physics_to_data_science
from_physics_to_data_sciencefrom_physics_to_data_science
from_physics_to_data_science
Ā 
Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...
Ā 
Data Science Intro.pptx
Data Science Intro.pptxData Science Intro.pptx
Data Science Intro.pptx
Ā 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
Ā 
Open Science for sustainability and inclusiveness: the SKA role model
 Open Science for sustainability and inclusiveness: the SKA role model Open Science for sustainability and inclusiveness: the SKA role model
Open Science for sustainability and inclusiveness: the SKA role model
Ā 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Ā 
The field-guide-to-data-science
The field-guide-to-data-scienceThe field-guide-to-data-science
The field-guide-to-data-science
Ā 
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...
Ā 
What data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientistsWhat data scientists really do, according to 50 data scientists
What data scientists really do, according to 50 data scientists
Ā 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data Science
Ā 
Decoding Data Science Job Descriptions
Decoding Data Science Job DescriptionsDecoding Data Science Job Descriptions
Decoding Data Science Job Descriptions
Ā 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data Science
Ā 
Converged IT and Data Commons
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
Ā 
Open Research methodologies
Open Research methodologiesOpen Research methodologies
Open Research methodologies
Ā 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Online
Ā 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
Ā 
MedChemica BigData What Is That All About?
MedChemica BigData What Is That All About?MedChemica BigData What Is That All About?
MedChemica BigData What Is That All About?
Ā 
Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016Workshop - finding and accessing data - Cambridge August 22 2016
Workshop - finding and accessing data - Cambridge August 22 2016
Ā 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist?
Ā 
The future isn't what it used to be
The future isn't what it used to beThe future isn't what it used to be
The future isn't what it used to be
Ā 

Recently uploaded

Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...amitlee9823
Ā 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
Ā 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
Ā 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
Ā 
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Callshivangimorya083
Ā 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
Ā 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
Ā 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
Ā 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
Ā 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
Ā 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
Ā 
Call Girls Indiranagar Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...Call Girls Indiranagar Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...amitlee9823
Ā 
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...amitlee9823
Ā 
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort ServiceBDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort ServiceDelhi Call girls
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
Ā 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
Ā 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
Ā 

Recently uploaded (20)

Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore ...
Ā 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
Ā 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
Ā 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Ā 
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ā˜Žāœ”šŸ‘Œāœ” Whatsapp Hard And Sexy Vip Call
Ā 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Ā 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Ā 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
Ā 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
Ā 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Ā 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
Ā 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
Ā 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
Ā 
Call Girls Indiranagar Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...Call Girls Indiranagar Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call šŸ‘— 7737669865 šŸ‘— Top Class Call Girl Service B...
Ā 
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: šŸ“ 7737669865 šŸ“ High Profile Model Escorts | Bangalore...
Ā 
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort ServiceBDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
BDSMāš”Call Girls in Mandawali Delhi >ą¼’8448380779 Escort Service
Ā 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Ā 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
Ā 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
Ā 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Ā 

RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control

  • 1. Welcome & Todayā€™s Schedule Registration and breakfast How I learned to stop worrying and love version control ā€“ Dr Stephen J Newhouse and Luke Marsden Effective computing for research reproducibility ā€“ Dr Laura Fortunato Morning Break A crazy little thing called reproducible science ā€“ Dr Tania Allard Machine Learning in Production - A practical approach to continuous deployment of Machine Learning pipelines ā€“ Luca Palmieri & Christos Dimitroulas Lunch Version Control for your Model, Data and Environment ā€“ Workshop Networking drinks 09:00 10:00 13:00 14:00 ā€“ 17:30 18:00 10:40 11:20 11:40 12:20 #rapids2018 @getdotmesh
  • 4. How I learned to stop worrying and love version control Steve Newhouse & Luke Marsden #rapids2018 @getdotmesh
  • 5. Who am I? Dr. Stephen J Newhouse Lead Data Scientist & Senior Bioinformatician ā‡¢ KCL Department of Biostatistics and Health Informatics ā‡¢ NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust ā‡¢ UCL Institute of Health Informatics & Health Data Research (HDR) UK #rapids2018 @getdotmesh
  • 6. Our (broad) interests... ā‡¢ From Bench (to Computer) to Bedsideā€¦ ā‡¢ Collaborative & Open Researchā€¦ ā‡¢ Data & knowledge sharingā€¦ ā‡¢ DevOps applied to Health Care/Research ā‡¢ Personalised Medicine & the Quantified Self #rapids2018 @getdotmesh
  • 8. ā€œProvenance is the the Missing Feature for Rigorous Data Science Joe Doliner Co-Founder, CEO of Pachyderm #rapids2018 @getdotmesh
  • 9. ā€œ Hazy: Making it Easier to Build and Maintain Big-data Analytics The next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and maintain existing algorithms #rapids2018 @getdotmesh
  • 10. ā€œThe next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and maintain existing algorithms In A Fully Reproducible Way And Under Complete Provenance Steve Newhouse, RAPIDS 2018 #rapids2018 @getdotmesh
  • 11. Provenance & Reproducibility ā‡¢ Provenance: ā€œOrigin of somethingā€¦ā€ ā‡¢ Track & Document the data source and the ā€œmodelsā€ ā‡¢ Who, What, Where, Why, When and How at every stage of ETL,EDA,ML,Reporting ā‡¢ Captures dependency between data sets: Enables reproducibility ā‡¢ Results can be traced back to their origins and recomputed from scratch: + Good for REPRODUCIBILITY + Good Practice (should be BEST PRACTICE) + Good for GDPR! #rapids2018 @getdotmesh
  • 12. Why RAPIDS is important to me #rapids2018 @getdotmesh
  • 13.
  • 14. HDR UK: A new national Institute for health data science ā‡¢ Structured and unstructured (e.g. imaging, text) data for derivation of new or deep phenotypes ā‡¢ Adding value at scale to existing world-leading cohorts in the UK ā‡¢ Demonstrating system-wide opportunities for research that improves quality of care ā‡¢ Enable large scale, high-throughput research that combines genomic data with electronic health records ā‡¢ Genomics, epigenomics, statistical and complex genetics, population genetics, cancer ā€˜omicsā€™, molecular epidemiology Actionable Health Data Analytics Precision Medicine #rapids2018 @getdotmesh
  • 15. HDR UK: A new national Institute for health data science ā‡¢ Transform Phase II ā€“ Phase IV clinical trials including ā€˜real world evidenceā€™ studies ā‡¢ Towards prevention & early intervention ā‡¢ Ability to link health and administrative datasets across multiple environments ā‡¢ New technologies, from sensors to wearable devices to artificial intelligence 21st Century Trial Design Modernising Public Health Training Future Leaders in health data science #rapids2018 @getdotmesh
  • 16. Large-scale machine learning & mixed ā€˜omic analysis strategies for patient care... #rapids2018 @getdotmesh
  • 17. Investigator Monitor Laboratories Data Manager Analyst Clinician Results Clinical Data Itā€™s All About The Data and #datasaveslives #rapids2018 @getdotmesh
  • 18. How did I learn to stop worrying and love version control? #rapids2018 @getdotmesh
  • 19. I made the transition from Lab tech to Bioinformatician #rapids2018 @getdotmesh
  • 20. The things you discover... new ways of working that just made sense #rapids2018 @getdotmesh
  • 21. And the crap we do/did that contributed toā€¦ The Reproducibility Crisis #rapids2018 @getdotmesh
  • 22. My new world ā‡¢ Git, GitHub & READMEs... ā‡¢ R & Python Notebooks... ā‡¢ Meta-Data, Ontologies... ā‡¢ Shared Codeā€¦ ā‡¢ Open Science, Open Dataā€¦ ā‡¢ FAIR Principlesā€¦ ā‡¢ Community! #rapids2018 @getdotmesh
  • 23. My old world ā‡¢ I started in the lab...I was a Molecular Biologist/Geneticist ā‡¢ Provenance & Reproducibility == The Lab book, Publicationsā€¦ ā‡¢ Data stored on local Internal/External and University Drives/HPC #rapids2018 @getdotmesh
  • 24. Lab books: RAPIDS the old way ā‡¢ Record Everything ā‡¢ Name, Version, Project, Date. ā‡¢ Materials & Methods.. ā‡¢ Signed off (sometimes) ā‡¢ Double Checked (sometimes) ā‡¢ Varying level of Detail: one liners to prose... We (Lab folks) are kind of doing it anyway...with lab books #rapids2018 @getdotmesh
  • 25. How we (Basic Academia) often do Version Control ā‡¢ Formal Version Control & Data Provenance? Nope, not really ā‡¢ Documentation/Report: Excel, Word, Powerpoint & Images...cut and paste into lab book...heavily edited for publication... ā‡¢ Analysis: GUIs/SPSS, barplots in Excel...never record steps taken or software versions until publication... ā‡¢ Data: Local HDD, HPC, Dropbox...often only location is recordedā€¦ ā‡¢ Document & Data versions can often be overwritten or lost and then there is this... #rapids2018 @getdotmesh
  • 27. There is little to no formal Data/Model provenance & Version Control: The Story We Tell Raw Data v1 v2 Finalv3 Publication 1 Publication 2 Publication 3 #rapids2018 @getdotmesh
  • 28. There is little to no formal Data/Model provenance & Version Control: The Truth... Raw Data my_d ata v1.x x v2 Final xxx zzz Final Final v2xy Publication 1 Publication 2 Publication 3 The Truth... #rapids2018 @getdotmesh
  • 29. In Academia, a lot of folks* dont do Version Control & Provenance. If they do, its haphazard and under duress** It is not standard operating procedure! *Academics/basic researchers: Statisticians, Economists, Bio/health-informaticians, Biologists and clinicians who can do R/Python/Stata/SAS/SPSS. **Extra work needed in keeping lab books, documenting everything, cleaning code, sharing code...not used to this way of working ā€œ #rapids2018 @getdotmesh
  • 31. Culture Lack of awareness Lack of education #rapids2018 @getdotmesh
  • 32. It is not enforced in many labs Incentives are not aligned with RAPIDS Pressure to Publish Quickly #rapids2018 @getdotmesh
  • 33. ā€œ[I was] completely unaware of robust solutions & common best practices from the Software/DevOps Worldā€¦ So were my supervisors! #rapids2018 @getdotmesh
  • 34. The term Provenance was never mentioned Replication/Reproducibility were just catch-phrases Plus, this was all supposed to be captured in our Lab booksā€¦ and then in the publication? Right?.... #rapids2018 @getdotmesh
  • 36. I may be a bit cynical but... #rapids2018 @getdotmesh
  • 37. Some sobering reading The public are on to us the ā€“ ā€˜shoddyā€™ scientists ā‡¢ ā€œToo many of the findings that fill the academic ether are the result of shoddy experiments or poor analysisā€ / The Economist ā‡¢ How science goes wrong / The Economist ā‡¢ Trouble at the lab / The Economist ā‡¢ Is It Tough Love Time For Science? / The Big Think ā‡¢ Some of this is purely down to bad data management (& bad practices & lack of awareness & lack of education & so on and so forthā€¦) #rapids2018 @getdotmesh
  • 38. Garbage in, Garbage out (GIGO) Bad/NO Data Management/Experimental Design/Analysis Plan ā‡¢ Spurious results/False positives and negatives ā‡¢ Translational research suffers ā‡¢ The patients suffer ā‡¢ Lies are published ā‡¢ Time & Money wasted (Charity, Public, Privateā€¦) ā‡¢ There is no real progress ā‡¢ Serious Legal & Ethical implications: GDPR! #rapids2018 @getdotmesh
  • 39. The Reproducibility Crisis: It's not just the fields of psychology & medicine... #rapids2018 @getdotmesh
  • 40. Matthew Hutson said... ā€œArtificial intelligence faces reproducibility crisisā€ Matthew Hutson, Science 2018 #rapids2018 @getdotmesh
  • 41. AI faces a reproducibility crisis ā‡¢ ā€œI think people outside the field might assume that because we have code, reproducibility is kind of guaranteed,ā€ ā€¦. ā€œFar from it.ā€ ā‡¢ The most basic problem is that researchers often donā€™t share their source code (and their Data) ā‡¢ ā€œThe exact way that you run your experiments is full of undocumented assumptions and decisions,ā€....ā€œA lot of this detail never makes it into papers.ā€ ā‡¢ ā€œNo time to document every hyperparameterā€... #rapids2018 @getdotmesh
  • 42. Some common issues ā‡¢ Common misconception: Only CS/Software Devs need to do it ā‡¢ Lack of awareness from the Top-down & bottom-up: the lab lead/PI does not know about GIT and/or has not signed up to OPEN SCIENCE ā‡¢ Personalities/Culture/Environment: Why should I share? Data Hoardingā€¦ ā‡¢ Fear of being judged, Fear of the unknown, Fear of the command line ā‡¢ Laziness? - Adding extra steps to their workflow - ā€œI have to do what now???ā€ #rapids2018 @getdotmesh
  • 43. There is a need for Reproducibility and Provenance in Data Science #rapids2018 @getdotmesh
  • 44. There is a need for Reproducibility and Provenance in Everything #rapids2018 @getdotmesh
  • 45. ā€œOne of the largest sources of error in [Data] Science results from computing [and publishing] results from different versions of the same data set. #rapids2018 @getdotmesh
  • 46. And using different versions of the same softwareā€¦ #rapids2018 @getdotmesh
  • 47. And using different versions/implementations of the ā€œsameā€ algorithm #rapids2018 @getdotmesh
  • 48. And failing to capture & share all the steps taken when building your ML/AI modelā€¦ Seed? Hyperparameters? Training Split? Features? Precision? Recall? Time of Day? #rapids2018 @getdotmesh
  • 49. And failing to capture the state of your Data at each iteration of the analysis... #rapids2018 @getdotmesh
  • 50. The wider community is aware of this There are solutions We are getting better at it #rapids2018 @getdotmesh
  • 51. Now over to Luke... (No, not that one) #rapids2018 @getdotmesh
  • 52. Who am I? Luke Marsden Founder & CEO of dotmesh ā‡¢ Hacker & entrepreneur ā‡¢ Developed first storage system & volume plugin system for Docker ā‡¢ Kubernetes SIG lead ā‡¢ Formerly Computer Science @ Oxford #rapids2018 @getdotmesh
  • 53. So you want to do reproducible data science/AI/ML? What do you need to pin down? #rapids2018 @getdotmesh
  • 54. So you want to do reproducible data science/AI/ML? Environment #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 55. So you want to do reproducible data science/AI/ML? Environment Code Including parameters #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 56. So you want to do reproducible data science/AI/ML? Environment Code Including parameters Data #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 58. Pinning down environment ā‡¢ In the DevOps world, Docker has been a big hit. ā‡¢ Docker helps you pin down the execution environment that your model training (or other data work) is happening in. ā‡¢ What is Docker? #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 59. What is docker? ā‡¢ Like tiny frozen, runnable copies of your computer's filesystem - e.g. Python libraries, Python versions ā‡¢ You can determine the exact version of all the dependencies of your data science code ā‡¢ You can build, ship & run exactly the same thing anywhereā€¦ your laptop, a cluster, or the cloud ā‡¢ Dockerfile lets you declare what versions of things you want; build a dockerfile from a docker image and push it to a registry #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 60. Pinning down code ā‡¢ For decades developers have been version controlling their code. ā‡¢ Tools like git are very popular. #rapids2018 @getdotmesh
  • 61. What is git? ā‡¢ git looks kinda scary - but it's worth persisting ā‡¢ In data science, it's not natural to commit every time you change anything, e.g. while tuning parameters... ā‡¢ ...but you generate results while you're iterating A version control system. Lets you track versions of your code and collaborate with others by commit, clone, push, pullā€¦ Problems: #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 62. Pinning down data ā‡¢ Method one: be very very organised (meticulous folder structure) + Never overwrite filesā€¦ backup frequentlyā€¦ and get your whole team to do the same ā‡¢ Method two: use versioned S3 buckets #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 63. What is S3? ā‡¢ When you run an experiment, not natural to note down all the object versions ā‡¢ You generally care about the version of the whole bucket, not every single individual object (but S3 has no such notion) ā‡¢ You could build a system to track this, but you've got more important science to be doing... A scalable filesystem on Amazon Web Services. Store lots of data quite cheaply. Version your objects (files) so that you can solve the problem of data changing "under your feet". Problems: #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 64. So you want to track provenance in data science/AI/ML? What do you need to pin down? #rapids2018 @getdotmesh
  • 65. So you want to track provenance in data science/AI/ML? Data A Data B Data C Code 2Code 1 Model 1 Model 2 Input Input Output Output Output Input #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 66. Pinning down data provenance ā‡¢ CWL, Pachyderm ā‡¢ Require you to define the data pipeline up front If you can record the graph, you can point to any artefact/model and ask "show me exactly where this came from"... the exact version of the tool which generated it, what input data that tool used. And the transitive closure thereof. Possible tools: ā‡¢ You don't always know the data pipeline up front ā‡¢ Often you're figuring it out as you go along, and it's evolving... Problems: #rapids2018 @getdotmesh
  • 67. One more problem ā‡¢ Sad fact. People don't care about reproducibility as much as their day to day work (getting a paper published, shipping an optimised model to production, ā€¦) ā‡¢ Can we introduce reproducibility & provenance to people while also helping them get work done faster and more accurately? And collaborate better with their team? #rapids2018 @getdotmesh
  • 68. The sweetener - track summary stats ā‡¢ How do you track the progress/performance/results of your models? Your data science team? ā‡¢ Answers ranged from ā€œin a google spreadsheetā€ to ā€œin text fileā€, ā€œon a piece of paperā€ or even ā€œverballyā€! ā‡¢ Ideally, integrate summary stats tracking into a solution... We asked dozens of data scientists to describe their workflows and their pain points. One problem stood outā€¦ #rapids2018 @getdotmesh
  • 69. If only this was all a bit easier... #rapids2018 @getdotmesh
  • 70. The reason we're running this event today #rapids2018 @getdotmesh
  • 71. Introducing dotscience ā‡¢ Tracks environment with Docker ā‡¢ Tracks data in versioned S3 buckets + dotmesh filesystem ā‡¢ Tracks code versions which generate summary stats in dotmesh + diff against git ā‡¢ Integrates with Jupyter (RStudio & scripts coming soon) Environment CodeData Solves reproducibility: #rapids2018 @getdotmesh
  • 72. Introducing dotscience ā‡¢ Builds the provenance graph on the fly ā‡¢ For any dataset, see what code generated it as the output of which other code, transitively ā‡¢ For any model, see exactly what code generated it, and what data that model was trained on Solves provenance: Data C Code 2 Model 1 Model 2 Input Output Output #rapids2018 @getdotmesh
  • 73. Introducing dotscience ā‡¢ Builds a table and chart of every run. Snapshots and keeps together: + versioned dataset + versioned model + all model parameters + compute environment ā‡¢ See performance not just of your work over time, but your whole team. Solves summary stats tracking: Who When Parameters Error rate Alice 2 minutes ago filter_snps=150 60% Bob 2 hours ago filter_snps=200 30% Charlie 12 hours ago filter_snps=100 50% #rapids2018 @getdotmesh
  • 75. You can try this yourself this afternoon! #rapids2018 @getdotmesh
  • 76. Roadmap for dotscience ā‡¢ Cloud Storage ā‡¢ R & RStudio, scripts, 'ds run' CLI support ā‡¢ Cluster support - Kubernetes ā‡¢ Spark/HDFS, MLlib ā‡¢ Slice & dice ā‡¢ Collaboration ā‡¢ Search and discovery ā‡¢ Multi-tenant execution, 1-click cluster installer, local installers #rapids2018 @getdotmesh