RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control

Welcome & Today’s Schedule
Registration and breakfast
How I learned to stop worrying and love version control – Dr Stephen J
Newhouse and Luke Marsden
Effective computing for research reproducibility – Dr Laura Fortunato
Morning Break
A crazy little thing called reproducible science – Dr Tania Allard
Machine Learning in Production - A practical approach to continuous
deployment of Machine Learning pipelines – Luca Palmieri &
Christos Dimitroulas
Lunch
Version Control for your Model, Data and Environment – Workshop
Networking drinks
09:00
10:00
13:00
14:00 – 17:30
18:00
10:40
11:20
11:40
12:20
#rapids2018
@getdotmesh

Thank you!
Please tweet!
#rapids2018
@getdotmesh

How I learned to stop worrying
and love version control
Steve Newhouse & Luke Marsden
#rapids2018
@getdotmesh

Who am I?
Dr. Stephen J Newhouse
Lead Data Scientist & Senior Bioinformatician
⇢ KCL Department of Biostatistics and Health Informatics
⇢ NIHR Biomedical Research Centre at South London
and Maudsley NHS Foundation Trust
⇢ UCL Institute of Health Informatics & Health Data
Research (HDR) UK
#rapids2018
@getdotmesh

Our (broad) interests...
⇢ From Bench (to Computer) to Bedside…
⇢ Collaborative & Open Research…
⇢ Data & knowledge sharing…
⇢ DevOps applied to Health Care/Research
⇢ Personalised Medicine & the Quantified Self
#rapids2018
@getdotmesh

Provenance & Reproducibility
#rapids2018
@getdotmesh

“Provenance is the the Missing Feature
for Rigorous Data Science
Joe Doliner
Co-Founder, CEO of Pachyderm
#rapids2018
@getdotmesh

“
Hazy: Making it Easier to Build and Maintain Big-data Analytics
The next breakthrough in data analysis may not be in
individual algorithms, but in the ability to rapidly
combine, deploy, and maintain existing algorithms
#rapids2018
@getdotmesh

“The next breakthrough in data analysis may not be in
individual algorithms, but in the ability to rapidly
combine, deploy, and maintain existing algorithms
In A Fully Reproducible Way And Under
Complete Provenance
Steve Newhouse, RAPIDS 2018
#rapids2018
@getdotmesh

Provenance & Reproducibility
⇢ Provenance: “Origin of something…”
⇢ Track & Document the data source and the “models”
⇢ Who, What, Where, Why, When and How at every stage of ETL,EDA,ML,Reporting
⇢ Captures dependency between data sets: Enables reproducibility
⇢ Results can be traced back to their origins and recomputed from scratch:
+ Good for REPRODUCIBILITY
+ Good Practice (should be BEST PRACTICE)
+ Good for GDPR!
#rapids2018
@getdotmesh

Why RAPIDS is important to me
#rapids2018
@getdotmesh

HDR UK: A new national Institute for
health data science
⇢ Structured and unstructured (e.g.
imaging, text) data for derivation of new
or deep phenotypes
⇢ Adding value at scale to existing
world-leading cohorts in the UK
⇢ Demonstrating system-wide
opportunities for research that
improves quality of care
⇢ Enable large scale, high-throughput
research that combines genomic data
with electronic health records
⇢ Genomics, epigenomics, statistical and
complex genetics, population genetics,
cancer ‘omics’, molecular epidemiology
Actionable Health Data Analytics Precision Medicine
#rapids2018
@getdotmesh

HDR UK: A new national Institute for
health data science
⇢ Transform Phase II – Phase IV clinical
trials including ‘real world evidence’
studies
⇢ Towards prevention & early
intervention
⇢ Ability to link health and administrative
datasets across multiple environments
⇢ New technologies, from sensors to
wearable devices to artificial
intelligence
21st Century Trial Design Modernising Public Health
Training Future Leaders in
health data science
#rapids2018
@getdotmesh

Large-scale machine learning & mixed ‘omic
analysis strategies for patient care...
#rapids2018
@getdotmesh

Investigator Monitor
Laboratories Data Manager
Analyst
Clinician
Results Clinical Data
It’s All About
The Data
and
#datasaveslives
#rapids2018
@getdotmesh

How did I learn to stop worrying
and love version control?
#rapids2018
@getdotmesh

I made the transition from Lab
tech to Bioinformatician
#rapids2018
@getdotmesh

The things you
discover... new ways
of working that just
made sense
#rapids2018
@getdotmesh

And the crap we
do/did that
contributed to…
The Reproducibility
Crisis
#rapids2018
@getdotmesh

My new world
⇢ Git, GitHub & READMEs...
⇢ R & Python Notebooks...
⇢ Meta-Data, Ontologies...
⇢ Shared Code…
⇢ Open Science, Open Data…
⇢ FAIR Principles…
⇢ Community!
#rapids2018
@getdotmesh

My old world
⇢ I started in the lab...I was a
Molecular Biologist/Geneticist
⇢ Provenance & Reproducibility ==
The Lab book, Publications…
⇢ Data stored on local
Internal/External and University
Drives/HPC
#rapids2018
@getdotmesh

Lab books: RAPIDS the old way
⇢ Record Everything
⇢ Name, Version, Project, Date.
⇢ Materials & Methods..
⇢ Signed off (sometimes)
⇢ Double Checked (sometimes)
⇢ Varying level of Detail: one liners
to prose...
We (Lab folks) are kind of doing it
anyway...with lab books
#rapids2018
@getdotmesh

How we (Basic Academia) often do
Version Control
⇢ Formal Version Control & Data Provenance? Nope, not really
⇢ Documentation/Report: Excel, Word, Powerpoint & Images...cut and paste into lab
book...heavily edited for publication...
⇢ Analysis: GUIs/SPSS, barplots in Excel...never record steps taken or software versions
until publication...
⇢ Data: Local HDD, HPC, Dropbox...often only location is recorded…
⇢ Document & Data versions can often be overwritten or lost and then there is this...
#rapids2018
@getdotmesh

There is little to no formal Data/Model provenance
& Version Control: The Story We Tell
Raw
Data
v1 v2 Finalv3
Publication 1
Publication 2
Publication 3
#rapids2018
@getdotmesh

There is little to no formal Data/Model provenance
& Version Control: The Truth...
Raw
Data
my_d
ata
v1.x
x v2 Final
xxx zzz Final
Final
v2xy
Publication 1
Publication 2
Publication 3
The Truth...
#rapids2018
@getdotmesh

In Academia, a lot of folks*
dont do Version Control &
Provenance. If they do, its
haphazard and under duress**
It is not standard operating
procedure!
*Academics/basic researchers: Statisticians, Economists, Bio/health-informaticians, Biologists and
clinicians who can do R/Python/Stata/SAS/SPSS. **Extra work needed in keeping lab books, documenting
everything, cleaning code, sharing code...not used to this way of working
“
#rapids2018
@getdotmesh

Culture
Lack of awareness
Lack of education
#rapids2018
@getdotmesh

It is not enforced in many labs
Incentives are not aligned with RAPIDS
Pressure to Publish Quickly
#rapids2018
@getdotmesh

“[I was] completely unaware of robust
solutions & common best practices
from the Software/DevOps World…
So were my supervisors!
#rapids2018
@getdotmesh

The term Provenance was
never mentioned
Replication/Reproducibility
were just catch-phrases
Plus, this was all supposed to be captured in our Lab
books… and then in the publication? Right?....
#rapids2018
@getdotmesh

I may be a bit cynical but...
#rapids2018
@getdotmesh

Some sobering reading
The public are on to us the – ‘shoddy’ scientists
⇢ “Too many of the findings that fill the academic ether are the result of shoddy
experiments or poor analysis” / The Economist
⇢ How science goes wrong / The Economist
⇢ Trouble at the lab / The Economist
⇢ Is It Tough Love Time For Science? / The Big Think
⇢ Some of this is purely down to bad data management (& bad practices & lack of
awareness & lack of education & so on and so forth…)
#rapids2018
@getdotmesh

Garbage in, Garbage out (GIGO)
Bad/NO Data Management/Experimental Design/Analysis Plan
⇢ Spurious results/False positives and negatives
⇢ Translational research suffers
⇢ The patients suffer
⇢ Lies are published
⇢ Time & Money wasted (Charity, Public, Private…)
⇢ There is no real progress
⇢ Serious Legal & Ethical implications: GDPR!
#rapids2018
@getdotmesh

The Reproducibility Crisis: It's not just the
fields of psychology & medicine...
#rapids2018
@getdotmesh

Matthew Hutson said...
“Artificial intelligence faces reproducibility crisis” Matthew Hutson, Science 2018
#rapids2018
@getdotmesh

AI faces a reproducibility crisis
⇢ “I think people outside the field might assume that because we have code, reproducibility
is kind of guaranteed,” …. “Far from it.”
⇢ The most basic problem is that researchers often don’t share their source code
(and their Data)
⇢ “The exact way that you run your experiments is full of undocumented assumptions and
decisions,”....“A lot of this detail never makes it into papers.”
⇢ “No time to document every hyperparameter”...
#rapids2018
@getdotmesh

Some common issues
⇢ Common misconception: Only CS/Software Devs need to do it
⇢ Lack of awareness from the Top-down & bottom-up: the lab lead/PI does not know about
GIT and/or has not signed up to OPEN SCIENCE
⇢ Personalities/Culture/Environment: Why should I share? Data Hoarding…
⇢ Fear of being judged, Fear of the unknown, Fear of the command line
⇢ Laziness? - Adding extra steps to their workflow - “I have to do what now???”
#rapids2018
@getdotmesh

There is a need for Reproducibility and
Provenance in Data Science
#rapids2018
@getdotmesh

There is a need for Reproducibility and
Provenance in Everything
#rapids2018
@getdotmesh

“One of the largest sources of error in
[Data] Science results from computing
[and publishing] results from different
versions of the same data set.
#rapids2018
@getdotmesh

And using different versions of the
same software…
#rapids2018
@getdotmesh

And using different versions/implementations
of the “same” algorithm
#rapids2018
@getdotmesh

And failing to capture & share all the
steps taken when building your
ML/AI model…
Seed? Hyperparameters? Training Split? Features?
Precision? Recall? Time of Day?
#rapids2018
@getdotmesh

And failing to capture the state of your
Data at each iteration of the analysis...
#rapids2018
@getdotmesh

The wider community is aware of this
There are solutions
We are getting better at it
#rapids2018
@getdotmesh

Now over to Luke...
(No, not that one)
#rapids2018
@getdotmesh

Who am I?
Luke Marsden
Founder & CEO of dotmesh
⇢ Hacker & entrepreneur
⇢ Developed first storage system & volume plugin
system for Docker
⇢ Kubernetes SIG lead
⇢ Formerly Computer Science @ Oxford
#rapids2018
@getdotmesh

So you want to do reproducible data
science/AI/ML?
What do you need to pin down?
#rapids2018
@getdotmesh

science/AI/ML?
Environment
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh

science/AI/ML?
Environment
Code
Including
parameters
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh

science/AI/ML?
Environment
Code
Including
parameters
Data
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh

Pinning down environment
⇢ In the DevOps world, Docker has been a big hit.
⇢ Docker helps you pin down the execution
environment that your model training (or other
data work) is happening in.
⇢ What is Docker?
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh

What is docker?
⇢ Like tiny frozen, runnable copies of your
computer's filesystem - e.g. Python libraries,
Python versions
⇢ You can determine the exact version of all the
dependencies of your data science code
⇢ You can build, ship & run exactly the same thing
anywhere… your laptop, a cluster, or the cloud
⇢ Dockerfile lets you declare what versions of
things you want; build a dockerfile from a docker
image and push it to a registry
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh

Pinning down code
⇢ For decades developers have been version
controlling their code.
⇢ Tools like git are very popular.
#rapids2018
@getdotmesh

What is git?
⇢ git looks kinda scary - but it's worth persisting
⇢ In data science, it's not natural to commit every
time you change anything, e.g. while tuning
parameters...
⇢ ...but you generate results while you're iterating
A version control system. Lets you track
versions of your code and collaborate with
others by commit, clone, push, pull…
Problems:
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh

Pinning down data
⇢ Method one: be very very organised
(meticulous folder structure)
+ Never overwrite files… backup
frequently… and get your whole team
to do the same
⇢ Method two: use versioned S3 buckets
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh

What is S3?
⇢ When you run an experiment, not natural to note
down all the object versions
⇢ You generally care about the version of the
whole bucket, not every single individual object
(but S3 has no such notion)
⇢ You could build a system to track this, but you've
got more important science to be doing...
A scalable filesystem on Amazon Web Services.
Store lots of data quite cheaply. Version your
objects (files) so that you can solve the problem
of data changing "under your feet".
Problems:
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh

So you want to track provenance in data
science/AI/ML?
What do you need to pin down?
#rapids2018
@getdotmesh

So you want to track provenance in data
science/AI/ML?
Data A
Data B
Data C Code 2Code 1
Model 1
Model 2
Input
Input Output Output
Output
Input
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh

Pinning down data provenance
⇢ CWL, Pachyderm
⇢ Require you to define the data
pipeline up front
If you can record the graph, you can point to any artefact/model and ask "show
me exactly where this came from"... the exact version of the tool which generated
it, what input data that tool used. And the transitive closure thereof.
Possible tools:
⇢ You don't always know the data
pipeline up front
⇢ Often you're figuring it out as you go
along, and it's evolving...
Problems:
#rapids2018
@getdotmesh

One more problem
⇢ Sad fact. People don't care about reproducibility
as much as their day to day work (getting a paper
published, shipping an optimised model to
production, …)
⇢ Can we introduce reproducibility & provenance to
people while also helping them get work done
faster and more accurately? And collaborate
better with their team?
#rapids2018
@getdotmesh

The sweetener - track summary stats
⇢ How do you track the
progress/performance/results of your models?
Your data science team?
⇢ Answers ranged from “in a google
spreadsheet” to “in text file”, “on a piece of
paper” or even “verbally”!
⇢ Ideally, integrate summary stats tracking
into a solution...
We asked dozens of data scientists to describe
their workflows and their pain points.
One problem stood out…
#rapids2018
@getdotmesh

If only this was all a bit easier...
#rapids2018
@getdotmesh

The reason we're running this event today
#rapids2018
@getdotmesh

Introducing dotscience
⇢ Tracks environment with Docker
⇢ Tracks data in versioned S3 buckets +
dotmesh filesystem
⇢ Tracks code versions which generate
summary stats in dotmesh + diff against git
⇢ Integrates with Jupyter (RStudio & scripts
coming soon)
Environment
CodeData
Solves reproducibility:
#rapids2018
@getdotmesh

⇢ Builds the provenance graph on the fly
⇢ For any dataset, see what code generated it
as the output of which other code,
transitively
⇢ For any model, see exactly what code
generated it, and what data that model was
trained on
Solves provenance:
Data C
Code 2
Model 1 Model 2
Input
Output Output
#rapids2018
@getdotmesh

⇢ Builds a table and chart of every run.
Snapshots and keeps together:
+ versioned dataset
+ versioned model
+ all model parameters
+ compute environment
⇢ See performance not just of your
work over time, but your whole team.
Solves summary stats tracking:
Who When Parameters Error rate
Alice 2 minutes ago filter_snps=150 60%
Bob 2 hours ago filter_snps=200 30%
Charlie 12 hours ago filter_snps=100 50%
#rapids2018
@getdotmesh

Live demo time!
#rapids2018
@getdotmesh

You can try this yourself this afternoon!
#rapids2018
@getdotmesh

Roadmap for dotscience
⇢ Cloud Storage
⇢ R & RStudio, scripts, 'ds run' CLI support
⇢ Cluster support - Kubernetes
⇢ Spark/HDFS, MLlib
⇢ Slice & dice
⇢ Collaboration
⇢ Search and discovery
⇢ Multi-tenant execution, 1-click cluster installer, local installers
#rapids2018
@getdotmesh

We need
your help!
#rapids2018
@getdotmesh

Thanks, questions?
beta.dotscience.io
slack.dotscience.io
@lmarsden
@s_j_newhouse
#rapids2018
@getdotmesh

RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control

Recommended

Recommended

More Related Content

Similar to RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control

Similar to RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control (20)

Recently uploaded

Recently uploaded (20)

RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control