RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control

D
Welcome & Today’s Schedule
Registration and breakfast
How I learned to stop worrying and love version control – Dr Stephen J
Newhouse and Luke Marsden
Effective computing for research reproducibility – Dr Laura Fortunato
Morning Break
A crazy little thing called reproducible science – Dr Tania Allard
Machine Learning in Production - A practical approach to continuous
deployment of Machine Learning pipelines – Luca Palmieri &
Christos Dimitroulas
Lunch
Version Control for your Model, Data and Environment – Workshop
Networking drinks
09:00
10:00
13:00
14:00 – 17:30
18:00
10:40
11:20
11:40
12:20
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
Thank you!
Please tweet!
#rapids2018
@getdotmesh
How I learned to stop worrying
and love version control
Steve Newhouse & Luke Marsden
#rapids2018
@getdotmesh
Who am I?
Dr. Stephen J Newhouse
Lead Data Scientist & Senior Bioinformatician
⇢ KCL Department of Biostatistics and Health Informatics
⇢ NIHR Biomedical Research Centre at South London
and Maudsley NHS Foundation Trust
⇢ UCL Institute of Health Informatics & Health Data
Research (HDR) UK
#rapids2018
@getdotmesh
Our (broad) interests...
⇢ From Bench (to Computer) to Bedside…
⇢ Collaborative & Open Research…
⇢ Data & knowledge sharing…
⇢ DevOps applied to Health Care/Research
⇢ Personalised Medicine & the Quantified Self
#rapids2018
@getdotmesh
Provenance & Reproducibility
#rapids2018
@getdotmesh
“Provenance is the the Missing Feature
for Rigorous Data Science
Joe Doliner
Co-Founder, CEO of Pachyderm
#rapids2018
@getdotmesh
“
Hazy: Making it Easier to Build and Maintain Big-data Analytics
The next breakthrough in data analysis may not be in
individual algorithms, but in the ability to rapidly
combine, deploy, and maintain existing algorithms
#rapids2018
@getdotmesh
“The next breakthrough in data analysis may not be in
individual algorithms, but in the ability to rapidly
combine, deploy, and maintain existing algorithms
In A Fully Reproducible Way And Under
Complete Provenance
Steve Newhouse, RAPIDS 2018
#rapids2018
@getdotmesh
Provenance & Reproducibility
⇢ Provenance: “Origin of something…”
⇢ Track & Document the data source and the “models”
⇢ Who, What, Where, Why, When and How at every stage of ETL,EDA,ML,Reporting
⇢ Captures dependency between data sets: Enables reproducibility
⇢ Results can be traced back to their origins and recomputed from scratch:
+ Good for REPRODUCIBILITY
+ Good Practice (should be BEST PRACTICE)
+ Good for GDPR!
#rapids2018
@getdotmesh
Why RAPIDS is important to me
#rapids2018
@getdotmesh
RAPIDS 2018  - Keynote - How I learned to stop worrying and love version control
HDR UK: A new national Institute for
health data science
⇢ Structured and unstructured (e.g.
imaging, text) data for derivation of new
or deep phenotypes
⇢ Adding value at scale to existing
world-leading cohorts in the UK
⇢ Demonstrating system-wide
opportunities for research that
improves quality of care
⇢ Enable large scale, high-throughput
research that combines genomic data
with electronic health records
⇢ Genomics, epigenomics, statistical and
complex genetics, population genetics,
cancer ‘omics’, molecular epidemiology
Actionable Health Data Analytics Precision Medicine
#rapids2018
@getdotmesh
HDR UK: A new national Institute for
health data science
⇢ Transform Phase II – Phase IV clinical
trials including ‘real world evidence’
studies
⇢ Towards prevention & early
intervention
⇢ Ability to link health and administrative
datasets across multiple environments
⇢ New technologies, from sensors to
wearable devices to artificial
intelligence
21st Century Trial Design Modernising Public Health
Training Future Leaders in
health data science
#rapids2018
@getdotmesh
Large-scale machine learning & mixed ‘omic
analysis strategies for patient care...
#rapids2018
@getdotmesh
Investigator Monitor
Laboratories Data Manager
Analyst
Clinician
Results Clinical Data
It’s All About
The Data
and
#datasaveslives
#rapids2018
@getdotmesh
How did I learn to stop worrying
and love version control?
#rapids2018
@getdotmesh
I made the transition from Lab
tech to Bioinformatician
#rapids2018
@getdotmesh
The things you
discover... new ways
of working that just
made sense
#rapids2018
@getdotmesh
And the crap we
do/did that
contributed to…
The Reproducibility
Crisis
#rapids2018
@getdotmesh
My new world
⇢ Git, GitHub & READMEs...
⇢ R & Python Notebooks...
⇢ Meta-Data, Ontologies...
⇢ Shared Code…
⇢ Open Science, Open Data…
⇢ FAIR Principles…
⇢ Community!
#rapids2018
@getdotmesh
My old world
⇢ I started in the lab...I was a
Molecular Biologist/Geneticist
⇢ Provenance & Reproducibility ==
The Lab book, Publications…
⇢ Data stored on local
Internal/External and University
Drives/HPC
#rapids2018
@getdotmesh
Lab books: RAPIDS the old way
⇢ Record Everything
⇢ Name, Version, Project, Date.
⇢ Materials & Methods..
⇢ Signed off (sometimes)
⇢ Double Checked (sometimes)
⇢ Varying level of Detail: one liners
to prose...
We (Lab folks) are kind of doing it
anyway...with lab books
#rapids2018
@getdotmesh
How we (Basic Academia) often do
Version Control
⇢ Formal Version Control & Data Provenance? Nope, not really
⇢ Documentation/Report: Excel, Word, Powerpoint & Images...cut and paste into lab
book...heavily edited for publication...
⇢ Analysis: GUIs/SPSS, barplots in Excel...never record steps taken or software versions
until publication...
⇢ Data: Local HDD, HPC, Dropbox...often only location is recorded…
⇢ Document & Data versions can often be overwritten or lost and then there is this...
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
There is little to no formal Data/Model provenance
& Version Control: The Story We Tell
Raw
Data
v1 v2 Finalv3
Publication 1
Publication 2
Publication 3
#rapids2018
@getdotmesh
There is little to no formal Data/Model provenance
& Version Control: The Truth...
Raw
Data
my_d
ata
v1.x
x v2 Final
xxx zzz Final
Final
v2xy
Publication 1
Publication 2
Publication 3
The Truth...
#rapids2018
@getdotmesh
In Academia, a lot of folks*
dont do Version Control &
Provenance. If they do, its
haphazard and under duress**
It is not standard operating
procedure!
*Academics/basic researchers: Statisticians, Economists, Bio/health-informaticians, Biologists and
clinicians who can do R/Python/Stata/SAS/SPSS. **Extra work needed in keeping lab books, documenting
everything, cleaning code, sharing code...not used to this way of working
“
#rapids2018
@getdotmesh
Why?
#rapids2018
@getdotmesh
Culture
Lack of awareness
Lack of education
#rapids2018
@getdotmesh
It is not enforced in many labs
Incentives are not aligned with RAPIDS
Pressure to Publish Quickly
#rapids2018
@getdotmesh
“[I was] completely unaware of robust
solutions & common best practices
from the Software/DevOps World…
So were my supervisors!
#rapids2018
@getdotmesh
The term Provenance was
never mentioned
Replication/Reproducibility
were just catch-phrases
Plus, this was all supposed to be captured in our Lab
books… and then in the publication? Right?....
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
I may be a bit cynical but...
#rapids2018
@getdotmesh
Some sobering reading
The public are on to us the – ‘shoddy’ scientists
⇢ “Too many of the findings that fill the academic ether are the result of shoddy
experiments or poor analysis” / The Economist
⇢ How science goes wrong / The Economist
⇢ Trouble at the lab / The Economist
⇢ Is It Tough Love Time For Science? / The Big Think
⇢ Some of this is purely down to bad data management (& bad practices & lack of
awareness & lack of education & so on and so forth…)
#rapids2018
@getdotmesh
Garbage in, Garbage out (GIGO)
Bad/NO Data Management/Experimental Design/Analysis Plan
⇢ Spurious results/False positives and negatives
⇢ Translational research suffers
⇢ The patients suffer
⇢ Lies are published
⇢ Time & Money wasted (Charity, Public, Private…)
⇢ There is no real progress
⇢ Serious Legal & Ethical implications: GDPR!
#rapids2018
@getdotmesh
The Reproducibility Crisis: It's not just the
fields of psychology & medicine...
#rapids2018
@getdotmesh
Matthew Hutson said...
“Artificial intelligence faces reproducibility crisis” Matthew Hutson, Science 2018
#rapids2018
@getdotmesh
AI faces a reproducibility crisis
⇢ “I think people outside the field might assume that because we have code, reproducibility
is kind of guaranteed,” …. “Far from it.”
⇢ The most basic problem is that researchers often don’t share their source code
(and their Data)
⇢ “The exact way that you run your experiments is full of undocumented assumptions and
decisions,”....“A lot of this detail never makes it into papers.”
⇢ “No time to document every hyperparameter”...
#rapids2018
@getdotmesh
Some common issues
⇢ Common misconception: Only CS/Software Devs need to do it
⇢ Lack of awareness from the Top-down & bottom-up: the lab lead/PI does not know about
GIT and/or has not signed up to OPEN SCIENCE
⇢ Personalities/Culture/Environment: Why should I share? Data Hoarding…
⇢ Fear of being judged, Fear of the unknown, Fear of the command line
⇢ Laziness? - Adding extra steps to their workflow - “I have to do what now???”
#rapids2018
@getdotmesh
There is a need for Reproducibility and
Provenance in Data Science
#rapids2018
@getdotmesh
There is a need for Reproducibility and
Provenance in Everything
#rapids2018
@getdotmesh
“One of the largest sources of error in
[Data] Science results from computing
[and publishing] results from different
versions of the same data set.
#rapids2018
@getdotmesh
And using different versions of the
same software…
#rapids2018
@getdotmesh
And using different versions/implementations
of the “same” algorithm
#rapids2018
@getdotmesh
And failing to capture & share all the
steps taken when building your
ML/AI model…
Seed? Hyperparameters? Training Split? Features?
Precision? Recall? Time of Day?
#rapids2018
@getdotmesh
And failing to capture the state of your
Data at each iteration of the analysis...
#rapids2018
@getdotmesh
The wider community is aware of this
There are solutions
We are getting better at it
#rapids2018
@getdotmesh
Now over to Luke...
(No, not that one)
#rapids2018
@getdotmesh
Who am I?
Luke Marsden
Founder & CEO of dotmesh
⇢ Hacker & entrepreneur
⇢ Developed first storage system & volume plugin
system for Docker
⇢ Kubernetes SIG lead
⇢ Formerly Computer Science @ Oxford
#rapids2018
@getdotmesh
So you want to do reproducible data
science/AI/ML?
What do you need to pin down?
#rapids2018
@getdotmesh
So you want to do reproducible data
science/AI/ML?
Environment
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
So you want to do reproducible data
science/AI/ML?
Environment
Code
Including
parameters
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
So you want to do reproducible data
science/AI/ML?
Environment
Code
Including
parameters
Data
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
How?
#rapids2018
@getdotmesh
Pinning down environment
⇢ In the DevOps world, Docker has been a big hit.
⇢ Docker helps you pin down the execution
environment that your model training (or other
data work) is happening in.
⇢ What is Docker?
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
What is docker?
⇢ Like tiny frozen, runnable copies of your
computer's filesystem - e.g. Python libraries,
Python versions
⇢ You can determine the exact version of all the
dependencies of your data science code
⇢ You can build, ship & run exactly the same thing
anywhere… your laptop, a cluster, or the cloud
⇢ Dockerfile lets you declare what versions of
things you want; build a dockerfile from a docker
image and push it to a registry
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
Pinning down code
⇢ For decades developers have been version
controlling their code.
⇢ Tools like git are very popular.
#rapids2018
@getdotmesh
What is git?
⇢ git looks kinda scary - but it's worth persisting
⇢ In data science, it's not natural to commit every
time you change anything, e.g. while tuning
parameters...
⇢ ...but you generate results while you're iterating
A version control system. Lets you track
versions of your code and collaborate with
others by commit, clone, push, pull…
Problems:
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
Pinning down data
⇢ Method one: be very very organised
(meticulous folder structure)
+ Never overwrite files… backup
frequently… and get your whole team
to do the same
⇢ Method two: use versioned S3 buckets
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
What is S3?
⇢ When you run an experiment, not natural to note
down all the object versions
⇢ You generally care about the version of the
whole bucket, not every single individual object
(but S3 has no such notion)
⇢ You could build a system to track this, but you've
got more important science to be doing...
A scalable filesystem on Amazon Web Services.
Store lots of data quite cheaply. Version your
objects (files) so that you can solve the problem
of data changing "under your feet".
Problems:
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
So you want to track provenance in data
science/AI/ML?
What do you need to pin down?
#rapids2018
@getdotmesh
So you want to track provenance in data
science/AI/ML?
Data A
Data B
Data C Code 2Code 1
Model 1
Model 2
Input
Input Output Output
Output
Input
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
Pinning down data provenance
⇢ CWL, Pachyderm
⇢ Require you to define the data
pipeline up front
If you can record the graph, you can point to any artefact/model and ask "show
me exactly where this came from"... the exact version of the tool which generated
it, what input data that tool used. And the transitive closure thereof.
Possible tools:
⇢ You don't always know the data
pipeline up front
⇢ Often you're figuring it out as you go
along, and it's evolving...
Problems:
#rapids2018
@getdotmesh
One more problem
⇢ Sad fact. People don't care about reproducibility
as much as their day to day work (getting a paper
published, shipping an optimised model to
production, …)
⇢ Can we introduce reproducibility & provenance to
people while also helping them get work done
faster and more accurately? And collaborate
better with their team?
#rapids2018
@getdotmesh
The sweetener - track summary stats
⇢ How do you track the
progress/performance/results of your models?
Your data science team?
⇢ Answers ranged from “in a google
spreadsheet” to “in text file”, “on a piece of
paper” or even “verbally”!
⇢ Ideally, integrate summary stats tracking
into a solution...
We asked dozens of data scientists to describe
their workflows and their pain points.
One problem stood out…
#rapids2018
@getdotmesh
If only this was all a bit easier...
#rapids2018
@getdotmesh
The reason we're running this event today
#rapids2018
@getdotmesh
Introducing dotscience
⇢ Tracks environment with Docker
⇢ Tracks data in versioned S3 buckets +
dotmesh filesystem
⇢ Tracks code versions which generate
summary stats in dotmesh + diff against git
⇢ Integrates with Jupyter (RStudio & scripts
coming soon)
Environment
CodeData
Solves reproducibility:
#rapids2018
@getdotmesh
Introducing dotscience
⇢ Builds the provenance graph on the fly
⇢ For any dataset, see what code generated it
as the output of which other code,
transitively
⇢ For any model, see exactly what code
generated it, and what data that model was
trained on
Solves provenance:
Data C
Code 2
Model 1 Model 2
Input
Output Output
#rapids2018
@getdotmesh
Introducing dotscience
⇢ Builds a table and chart of every run.
Snapshots and keeps together:
+ versioned dataset
+ versioned model
+ all model parameters
+ compute environment
⇢ See performance not just of your
work over time, but your whole team.
Solves summary stats tracking:
Who When Parameters Error rate
Alice 2 minutes ago filter_snps=150 60%
Bob 2 hours ago filter_snps=200 30%
Charlie 12 hours ago filter_snps=100 50%
#rapids2018
@getdotmesh
Live demo time!
#rapids2018
@getdotmesh
You can try this yourself this afternoon!
#rapids2018
@getdotmesh
Roadmap for dotscience
⇢ Cloud Storage
⇢ R & RStudio, scripts, 'ds run' CLI support
⇢ Cluster support - Kubernetes
⇢ Spark/HDFS, MLlib
⇢ Slice & dice
⇢ Collaboration
⇢ Search and discovery
⇢ Multi-tenant execution, 1-click cluster installer, local installers
#rapids2018
@getdotmesh
We need
your help!
#rapids2018
@getdotmesh
Thanks, questions?
beta.dotscience.io
slack.dotscience.io
@lmarsden
@s_j_newhouse
#rapids2018
@getdotmesh
1 of 78

More Related Content

Similar to RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control(20)

from_physics_to_data_sciencefrom_physics_to_data_science
from_physics_to_data_science
Martina Pugliese1.4K views
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
PerumalPitchandi17 views
Data Science Intro.pptxData Science Intro.pptx
Data Science Intro.pptx
PerumalPitchandi5 views
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Laguna State Polytechnic University110.1K views
The field-guide-to-data-scienceThe field-guide-to-data-science
The field-guide-to-data-science
Booz Allen Hamilton944 views
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data Science
Booz Allen Hamilton1.7K views
Converged IT and Data CommonsConverged IT and Data Commons
Converged IT and Data Commons
Simon Twigger25 views
Open Research methodologiesOpen Research methodologies
Open Research methodologies
jessykate633 views
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
c.titus.brown1.9K views
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist?
HackerEarth11.4K views

Recently uploaded(20)

RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control

  • 1. Welcome & Today’s Schedule Registration and breakfast How I learned to stop worrying and love version control – Dr Stephen J Newhouse and Luke Marsden Effective computing for research reproducibility – Dr Laura Fortunato Morning Break A crazy little thing called reproducible science – Dr Tania Allard Machine Learning in Production - A practical approach to continuous deployment of Machine Learning pipelines – Luca Palmieri & Christos Dimitroulas Lunch Version Control for your Model, Data and Environment – Workshop Networking drinks 09:00 10:00 13:00 14:00 – 17:30 18:00 10:40 11:20 11:40 12:20 #rapids2018 @getdotmesh
  • 4. How I learned to stop worrying and love version control Steve Newhouse & Luke Marsden #rapids2018 @getdotmesh
  • 5. Who am I? Dr. Stephen J Newhouse Lead Data Scientist & Senior Bioinformatician ⇢ KCL Department of Biostatistics and Health Informatics ⇢ NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust ⇢ UCL Institute of Health Informatics & Health Data Research (HDR) UK #rapids2018 @getdotmesh
  • 6. Our (broad) interests... ⇢ From Bench (to Computer) to Bedside… ⇢ Collaborative & Open Research… ⇢ Data & knowledge sharing… ⇢ DevOps applied to Health Care/Research ⇢ Personalised Medicine & the Quantified Self #rapids2018 @getdotmesh
  • 8. “Provenance is the the Missing Feature for Rigorous Data Science Joe Doliner Co-Founder, CEO of Pachyderm #rapids2018 @getdotmesh
  • 9. “ Hazy: Making it Easier to Build and Maintain Big-data Analytics The next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and maintain existing algorithms #rapids2018 @getdotmesh
  • 10. “The next breakthrough in data analysis may not be in individual algorithms, but in the ability to rapidly combine, deploy, and maintain existing algorithms In A Fully Reproducible Way And Under Complete Provenance Steve Newhouse, RAPIDS 2018 #rapids2018 @getdotmesh
  • 11. Provenance & Reproducibility ⇢ Provenance: “Origin of something…” ⇢ Track & Document the data source and the “models” ⇢ Who, What, Where, Why, When and How at every stage of ETL,EDA,ML,Reporting ⇢ Captures dependency between data sets: Enables reproducibility ⇢ Results can be traced back to their origins and recomputed from scratch: + Good for REPRODUCIBILITY + Good Practice (should be BEST PRACTICE) + Good for GDPR! #rapids2018 @getdotmesh
  • 12. Why RAPIDS is important to me #rapids2018 @getdotmesh
  • 14. HDR UK: A new national Institute for health data science ⇢ Structured and unstructured (e.g. imaging, text) data for derivation of new or deep phenotypes ⇢ Adding value at scale to existing world-leading cohorts in the UK ⇢ Demonstrating system-wide opportunities for research that improves quality of care ⇢ Enable large scale, high-throughput research that combines genomic data with electronic health records ⇢ Genomics, epigenomics, statistical and complex genetics, population genetics, cancer ‘omics’, molecular epidemiology Actionable Health Data Analytics Precision Medicine #rapids2018 @getdotmesh
  • 15. HDR UK: A new national Institute for health data science ⇢ Transform Phase II – Phase IV clinical trials including ‘real world evidence’ studies ⇢ Towards prevention & early intervention ⇢ Ability to link health and administrative datasets across multiple environments ⇢ New technologies, from sensors to wearable devices to artificial intelligence 21st Century Trial Design Modernising Public Health Training Future Leaders in health data science #rapids2018 @getdotmesh
  • 16. Large-scale machine learning & mixed ‘omic analysis strategies for patient care... #rapids2018 @getdotmesh
  • 17. Investigator Monitor Laboratories Data Manager Analyst Clinician Results Clinical Data It’s All About The Data and #datasaveslives #rapids2018 @getdotmesh
  • 18. How did I learn to stop worrying and love version control? #rapids2018 @getdotmesh
  • 19. I made the transition from Lab tech to Bioinformatician #rapids2018 @getdotmesh
  • 20. The things you discover... new ways of working that just made sense #rapids2018 @getdotmesh
  • 21. And the crap we do/did that contributed to… The Reproducibility Crisis #rapids2018 @getdotmesh
  • 22. My new world ⇢ Git, GitHub & READMEs... ⇢ R & Python Notebooks... ⇢ Meta-Data, Ontologies... ⇢ Shared Code… ⇢ Open Science, Open Data… ⇢ FAIR Principles… ⇢ Community! #rapids2018 @getdotmesh
  • 23. My old world ⇢ I started in the lab...I was a Molecular Biologist/Geneticist ⇢ Provenance & Reproducibility == The Lab book, Publications… ⇢ Data stored on local Internal/External and University Drives/HPC #rapids2018 @getdotmesh
  • 24. Lab books: RAPIDS the old way ⇢ Record Everything ⇢ Name, Version, Project, Date. ⇢ Materials & Methods.. ⇢ Signed off (sometimes) ⇢ Double Checked (sometimes) ⇢ Varying level of Detail: one liners to prose... We (Lab folks) are kind of doing it anyway...with lab books #rapids2018 @getdotmesh
  • 25. How we (Basic Academia) often do Version Control ⇢ Formal Version Control & Data Provenance? Nope, not really ⇢ Documentation/Report: Excel, Word, Powerpoint & Images...cut and paste into lab book...heavily edited for publication... ⇢ Analysis: GUIs/SPSS, barplots in Excel...never record steps taken or software versions until publication... ⇢ Data: Local HDD, HPC, Dropbox...often only location is recorded… ⇢ Document & Data versions can often be overwritten or lost and then there is this... #rapids2018 @getdotmesh
  • 27. There is little to no formal Data/Model provenance & Version Control: The Story We Tell Raw Data v1 v2 Finalv3 Publication 1 Publication 2 Publication 3 #rapids2018 @getdotmesh
  • 28. There is little to no formal Data/Model provenance & Version Control: The Truth... Raw Data my_d ata v1.x x v2 Final xxx zzz Final Final v2xy Publication 1 Publication 2 Publication 3 The Truth... #rapids2018 @getdotmesh
  • 29. In Academia, a lot of folks* dont do Version Control & Provenance. If they do, its haphazard and under duress** It is not standard operating procedure! *Academics/basic researchers: Statisticians, Economists, Bio/health-informaticians, Biologists and clinicians who can do R/Python/Stata/SAS/SPSS. **Extra work needed in keeping lab books, documenting everything, cleaning code, sharing code...not used to this way of working “ #rapids2018 @getdotmesh
  • 31. Culture Lack of awareness Lack of education #rapids2018 @getdotmesh
  • 32. It is not enforced in many labs Incentives are not aligned with RAPIDS Pressure to Publish Quickly #rapids2018 @getdotmesh
  • 33. “[I was] completely unaware of robust solutions & common best practices from the Software/DevOps World… So were my supervisors! #rapids2018 @getdotmesh
  • 34. The term Provenance was never mentioned Replication/Reproducibility were just catch-phrases Plus, this was all supposed to be captured in our Lab books… and then in the publication? Right?.... #rapids2018 @getdotmesh
  • 36. I may be a bit cynical but... #rapids2018 @getdotmesh
  • 37. Some sobering reading The public are on to us the – ‘shoddy’ scientists ⇢ “Too many of the findings that fill the academic ether are the result of shoddy experiments or poor analysis” / The Economist ⇢ How science goes wrong / The Economist ⇢ Trouble at the lab / The Economist ⇢ Is It Tough Love Time For Science? / The Big Think ⇢ Some of this is purely down to bad data management (& bad practices & lack of awareness & lack of education & so on and so forth…) #rapids2018 @getdotmesh
  • 38. Garbage in, Garbage out (GIGO) Bad/NO Data Management/Experimental Design/Analysis Plan ⇢ Spurious results/False positives and negatives ⇢ Translational research suffers ⇢ The patients suffer ⇢ Lies are published ⇢ Time & Money wasted (Charity, Public, Private…) ⇢ There is no real progress ⇢ Serious Legal & Ethical implications: GDPR! #rapids2018 @getdotmesh
  • 39. The Reproducibility Crisis: It's not just the fields of psychology & medicine... #rapids2018 @getdotmesh
  • 40. Matthew Hutson said... “Artificial intelligence faces reproducibility crisis” Matthew Hutson, Science 2018 #rapids2018 @getdotmesh
  • 41. AI faces a reproducibility crisis ⇢ “I think people outside the field might assume that because we have code, reproducibility is kind of guaranteed,” …. “Far from it.” ⇢ The most basic problem is that researchers often don’t share their source code (and their Data) ⇢ “The exact way that you run your experiments is full of undocumented assumptions and decisions,”....“A lot of this detail never makes it into papers.” ⇢ “No time to document every hyperparameter”... #rapids2018 @getdotmesh
  • 42. Some common issues ⇢ Common misconception: Only CS/Software Devs need to do it ⇢ Lack of awareness from the Top-down & bottom-up: the lab lead/PI does not know about GIT and/or has not signed up to OPEN SCIENCE ⇢ Personalities/Culture/Environment: Why should I share? Data Hoarding… ⇢ Fear of being judged, Fear of the unknown, Fear of the command line ⇢ Laziness? - Adding extra steps to their workflow - “I have to do what now???” #rapids2018 @getdotmesh
  • 43. There is a need for Reproducibility and Provenance in Data Science #rapids2018 @getdotmesh
  • 44. There is a need for Reproducibility and Provenance in Everything #rapids2018 @getdotmesh
  • 45. “One of the largest sources of error in [Data] Science results from computing [and publishing] results from different versions of the same data set. #rapids2018 @getdotmesh
  • 46. And using different versions of the same software… #rapids2018 @getdotmesh
  • 47. And using different versions/implementations of the “same” algorithm #rapids2018 @getdotmesh
  • 48. And failing to capture & share all the steps taken when building your ML/AI model… Seed? Hyperparameters? Training Split? Features? Precision? Recall? Time of Day? #rapids2018 @getdotmesh
  • 49. And failing to capture the state of your Data at each iteration of the analysis... #rapids2018 @getdotmesh
  • 50. The wider community is aware of this There are solutions We are getting better at it #rapids2018 @getdotmesh
  • 51. Now over to Luke... (No, not that one) #rapids2018 @getdotmesh
  • 52. Who am I? Luke Marsden Founder & CEO of dotmesh ⇢ Hacker & entrepreneur ⇢ Developed first storage system & volume plugin system for Docker ⇢ Kubernetes SIG lead ⇢ Formerly Computer Science @ Oxford #rapids2018 @getdotmesh
  • 53. So you want to do reproducible data science/AI/ML? What do you need to pin down? #rapids2018 @getdotmesh
  • 54. So you want to do reproducible data science/AI/ML? Environment #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 55. So you want to do reproducible data science/AI/ML? Environment Code Including parameters #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 56. So you want to do reproducible data science/AI/ML? Environment Code Including parameters Data #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 58. Pinning down environment ⇢ In the DevOps world, Docker has been a big hit. ⇢ Docker helps you pin down the execution environment that your model training (or other data work) is happening in. ⇢ What is Docker? #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 59. What is docker? ⇢ Like tiny frozen, runnable copies of your computer's filesystem - e.g. Python libraries, Python versions ⇢ You can determine the exact version of all the dependencies of your data science code ⇢ You can build, ship & run exactly the same thing anywhere… your laptop, a cluster, or the cloud ⇢ Dockerfile lets you declare what versions of things you want; build a dockerfile from a docker image and push it to a registry #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 60. Pinning down code ⇢ For decades developers have been version controlling their code. ⇢ Tools like git are very popular. #rapids2018 @getdotmesh
  • 61. What is git? ⇢ git looks kinda scary - but it's worth persisting ⇢ In data science, it's not natural to commit every time you change anything, e.g. while tuning parameters... ⇢ ...but you generate results while you're iterating A version control system. Lets you track versions of your code and collaborate with others by commit, clone, push, pull… Problems: #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 62. Pinning down data ⇢ Method one: be very very organised (meticulous folder structure) + Never overwrite files… backup frequently… and get your whole team to do the same ⇢ Method two: use versioned S3 buckets #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 63. What is S3? ⇢ When you run an experiment, not natural to note down all the object versions ⇢ You generally care about the version of the whole bucket, not every single individual object (but S3 has no such notion) ⇢ You could build a system to track this, but you've got more important science to be doing... A scalable filesystem on Amazon Web Services. Store lots of data quite cheaply. Version your objects (files) so that you can solve the problem of data changing "under your feet". Problems: #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 64. So you want to track provenance in data science/AI/ML? What do you need to pin down? #rapids2018 @getdotmesh
  • 65. So you want to track provenance in data science/AI/ML? Data A Data B Data C Code 2Code 1 Model 1 Model 2 Input Input Output Output Output Input #rapids2018 @getdotmesh #rapids2018 @getdotmesh
  • 66. Pinning down data provenance ⇢ CWL, Pachyderm ⇢ Require you to define the data pipeline up front If you can record the graph, you can point to any artefact/model and ask "show me exactly where this came from"... the exact version of the tool which generated it, what input data that tool used. And the transitive closure thereof. Possible tools: ⇢ You don't always know the data pipeline up front ⇢ Often you're figuring it out as you go along, and it's evolving... Problems: #rapids2018 @getdotmesh
  • 67. One more problem ⇢ Sad fact. People don't care about reproducibility as much as their day to day work (getting a paper published, shipping an optimised model to production, …) ⇢ Can we introduce reproducibility & provenance to people while also helping them get work done faster and more accurately? And collaborate better with their team? #rapids2018 @getdotmesh
  • 68. The sweetener - track summary stats ⇢ How do you track the progress/performance/results of your models? Your data science team? ⇢ Answers ranged from “in a google spreadsheet” to “in text file”, “on a piece of paper” or even “verbally”! ⇢ Ideally, integrate summary stats tracking into a solution... We asked dozens of data scientists to describe their workflows and their pain points. One problem stood out… #rapids2018 @getdotmesh
  • 69. If only this was all a bit easier... #rapids2018 @getdotmesh
  • 70. The reason we're running this event today #rapids2018 @getdotmesh
  • 71. Introducing dotscience ⇢ Tracks environment with Docker ⇢ Tracks data in versioned S3 buckets + dotmesh filesystem ⇢ Tracks code versions which generate summary stats in dotmesh + diff against git ⇢ Integrates with Jupyter (RStudio & scripts coming soon) Environment CodeData Solves reproducibility: #rapids2018 @getdotmesh
  • 72. Introducing dotscience ⇢ Builds the provenance graph on the fly ⇢ For any dataset, see what code generated it as the output of which other code, transitively ⇢ For any model, see exactly what code generated it, and what data that model was trained on Solves provenance: Data C Code 2 Model 1 Model 2 Input Output Output #rapids2018 @getdotmesh
  • 73. Introducing dotscience ⇢ Builds a table and chart of every run. Snapshots and keeps together: + versioned dataset + versioned model + all model parameters + compute environment ⇢ See performance not just of your work over time, but your whole team. Solves summary stats tracking: Who When Parameters Error rate Alice 2 minutes ago filter_snps=150 60% Bob 2 hours ago filter_snps=200 30% Charlie 12 hours ago filter_snps=100 50% #rapids2018 @getdotmesh
  • 75. You can try this yourself this afternoon! #rapids2018 @getdotmesh
  • 76. Roadmap for dotscience ⇢ Cloud Storage ⇢ R & RStudio, scripts, 'ds run' CLI support ⇢ Cluster support - Kubernetes ⇢ Spark/HDFS, MLlib ⇢ Slice & dice ⇢ Collaboration ⇢ Search and discovery ⇢ Multi-tenant execution, 1-click cluster installer, local installers #rapids2018 @getdotmesh