The keynote talk from RAPIDS 2018 in London.
Dr Stephen J Newhouse and Luke Marsden explain why now is the moment to take Reproducibility and Provenance in Data Science (RAPIDS) seriously, and how this can be achieved with process and tooling.
Stephen shares his experiences of the challenges in the industry and Luke introduces the beta version of Dotscience, a tool for model tracking and collaboration through RAPIDS.
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Ā
RAPIDS 2018 - Keynote - How I learned to stop worrying and love version control
1. Welcome & Todayās Schedule
Registration and breakfast
How I learned to stop worrying and love version control ā Dr Stephen J
Newhouse and Luke Marsden
Effective computing for research reproducibility ā Dr Laura Fortunato
Morning Break
A crazy little thing called reproducible science ā Dr Tania Allard
Machine Learning in Production - A practical approach to continuous
deployment of Machine Learning pipelines ā Luca Palmieri &
Christos Dimitroulas
Lunch
Version Control for your Model, Data and Environment ā Workshop
Networking drinks
09:00
10:00
13:00
14:00 ā 17:30
18:00
10:40
11:20
11:40
12:20
#rapids2018
@getdotmesh
4. How I learned to stop worrying
and love version control
Steve Newhouse & Luke Marsden
#rapids2018
@getdotmesh
5. Who am I?
Dr. Stephen J Newhouse
Lead Data Scientist & Senior Bioinformatician
ā¢ KCL Department of Biostatistics and Health Informatics
ā¢ NIHR Biomedical Research Centre at South London
and Maudsley NHS Foundation Trust
ā¢ UCL Institute of Health Informatics & Health Data
Research (HDR) UK
#rapids2018
@getdotmesh
6. Our (broad) interests...
ā¢ From Bench (to Computer) to Bedsideā¦
ā¢ Collaborative & Open Researchā¦
ā¢ Data & knowledge sharingā¦
ā¢ DevOps applied to Health Care/Research
ā¢ Personalised Medicine & the Quantified Self
#rapids2018
@getdotmesh
8. āProvenance is the the Missing Feature
for Rigorous Data Science
Joe Doliner
Co-Founder, CEO of Pachyderm
#rapids2018
@getdotmesh
9. ā
Hazy: Making it Easier to Build and Maintain Big-data Analytics
The next breakthrough in data analysis may not be in
individual algorithms, but in the ability to rapidly
combine, deploy, and maintain existing algorithms
#rapids2018
@getdotmesh
10. āThe next breakthrough in data analysis may not be in
individual algorithms, but in the ability to rapidly
combine, deploy, and maintain existing algorithms
In A Fully Reproducible Way And Under
Complete Provenance
Steve Newhouse, RAPIDS 2018
#rapids2018
@getdotmesh
11. Provenance & Reproducibility
ā¢ Provenance: āOrigin of somethingā¦ā
ā¢ Track & Document the data source and the āmodelsā
ā¢ Who, What, Where, Why, When and How at every stage of ETL,EDA,ML,Reporting
ā¢ Captures dependency between data sets: Enables reproducibility
ā¢ Results can be traced back to their origins and recomputed from scratch:
+ Good for REPRODUCIBILITY
+ Good Practice (should be BEST PRACTICE)
+ Good for GDPR!
#rapids2018
@getdotmesh
12. Why RAPIDS is important to me
#rapids2018
@getdotmesh
13.
14. HDR UK: A new national Institute for
health data science
ā¢ Structured and unstructured (e.g.
imaging, text) data for derivation of new
or deep phenotypes
ā¢ Adding value at scale to existing
world-leading cohorts in the UK
ā¢ Demonstrating system-wide
opportunities for research that
improves quality of care
ā¢ Enable large scale, high-throughput
research that combines genomic data
with electronic health records
ā¢ Genomics, epigenomics, statistical and
complex genetics, population genetics,
cancer āomicsā, molecular epidemiology
Actionable Health Data Analytics Precision Medicine
#rapids2018
@getdotmesh
15. HDR UK: A new national Institute for
health data science
ā¢ Transform Phase II ā Phase IV clinical
trials including āreal world evidenceā
studies
ā¢ Towards prevention & early
intervention
ā¢ Ability to link health and administrative
datasets across multiple environments
ā¢ New technologies, from sensors to
wearable devices to artificial
intelligence
21st Century Trial Design Modernising Public Health
Training Future Leaders in
health data science
#rapids2018
@getdotmesh
17. Investigator Monitor
Laboratories Data Manager
Analyst
Clinician
Results Clinical Data
Itās All About
The Data
and
#datasaveslives
#rapids2018
@getdotmesh
18. How did I learn to stop worrying
and love version control?
#rapids2018
@getdotmesh
19. I made the transition from Lab
tech to Bioinformatician
#rapids2018
@getdotmesh
21. And the crap we
do/did that
contributed toā¦
The Reproducibility
Crisis
#rapids2018
@getdotmesh
22. My new world
ā¢ Git, GitHub & READMEs...
ā¢ R & Python Notebooks...
ā¢ Meta-Data, Ontologies...
ā¢ Shared Codeā¦
ā¢ Open Science, Open Dataā¦
ā¢ FAIR Principlesā¦
ā¢ Community!
#rapids2018
@getdotmesh
23. My old world
ā¢ I started in the lab...I was a
Molecular Biologist/Geneticist
ā¢ Provenance & Reproducibility ==
The Lab book, Publicationsā¦
ā¢ Data stored on local
Internal/External and University
Drives/HPC
#rapids2018
@getdotmesh
24. Lab books: RAPIDS the old way
ā¢ Record Everything
ā¢ Name, Version, Project, Date.
ā¢ Materials & Methods..
ā¢ Signed off (sometimes)
ā¢ Double Checked (sometimes)
ā¢ Varying level of Detail: one liners
to prose...
We (Lab folks) are kind of doing it
anyway...with lab books
#rapids2018
@getdotmesh
25. How we (Basic Academia) often do
Version Control
ā¢ Formal Version Control & Data Provenance? Nope, not really
ā¢ Documentation/Report: Excel, Word, Powerpoint & Images...cut and paste into lab
book...heavily edited for publication...
ā¢ Analysis: GUIs/SPSS, barplots in Excel...never record steps taken or software versions
until publication...
ā¢ Data: Local HDD, HPC, Dropbox...often only location is recordedā¦
ā¢ Document & Data versions can often be overwritten or lost and then there is this...
#rapids2018
@getdotmesh
27. There is little to no formal Data/Model provenance
& Version Control: The Story We Tell
Raw
Data
v1 v2 Finalv3
Publication 1
Publication 2
Publication 3
#rapids2018
@getdotmesh
28. There is little to no formal Data/Model provenance
& Version Control: The Truth...
Raw
Data
my_d
ata
v1.x
x v2 Final
xxx zzz Final
Final
v2xy
Publication 1
Publication 2
Publication 3
The Truth...
#rapids2018
@getdotmesh
29. In Academia, a lot of folks*
dont do Version Control &
Provenance. If they do, its
haphazard and under duress**
It is not standard operating
procedure!
*Academics/basic researchers: Statisticians, Economists, Bio/health-informaticians, Biologists and
clinicians who can do R/Python/Stata/SAS/SPSS. **Extra work needed in keeping lab books, documenting
everything, cleaning code, sharing code...not used to this way of working
ā
#rapids2018
@getdotmesh
32. It is not enforced in many labs
Incentives are not aligned with RAPIDS
Pressure to Publish Quickly
#rapids2018
@getdotmesh
33. ā[I was] completely unaware of robust
solutions & common best practices
from the Software/DevOps Worldā¦
So were my supervisors!
#rapids2018
@getdotmesh
34. The term Provenance was
never mentioned
Replication/Reproducibility
were just catch-phrases
Plus, this was all supposed to be captured in our Lab
booksā¦ and then in the publication? Right?....
#rapids2018
@getdotmesh
36. I may be a bit cynical but...
#rapids2018
@getdotmesh
37. Some sobering reading
The public are on to us the ā āshoddyā scientists
ā¢ āToo many of the findings that fill the academic ether are the result of shoddy
experiments or poor analysisā / The Economist
ā¢ How science goes wrong / The Economist
ā¢ Trouble at the lab / The Economist
ā¢ Is It Tough Love Time For Science? / The Big Think
ā¢ Some of this is purely down to bad data management (& bad practices & lack of
awareness & lack of education & so on and so forthā¦)
#rapids2018
@getdotmesh
38. Garbage in, Garbage out (GIGO)
Bad/NO Data Management/Experimental Design/Analysis Plan
ā¢ Spurious results/False positives and negatives
ā¢ Translational research suffers
ā¢ The patients suffer
ā¢ Lies are published
ā¢ Time & Money wasted (Charity, Public, Privateā¦)
ā¢ There is no real progress
ā¢ Serious Legal & Ethical implications: GDPR!
#rapids2018
@getdotmesh
41. AI faces a reproducibility crisis
ā¢ āI think people outside the field might assume that because we have code, reproducibility
is kind of guaranteed,ā ā¦. āFar from it.ā
ā¢ The most basic problem is that researchers often donāt share their source code
(and their Data)
ā¢ āThe exact way that you run your experiments is full of undocumented assumptions and
decisions,ā....āA lot of this detail never makes it into papers.ā
ā¢ āNo time to document every hyperparameterā...
#rapids2018
@getdotmesh
42. Some common issues
ā¢ Common misconception: Only CS/Software Devs need to do it
ā¢ Lack of awareness from the Top-down & bottom-up: the lab lead/PI does not know about
GIT and/or has not signed up to OPEN SCIENCE
ā¢ Personalities/Culture/Environment: Why should I share? Data Hoardingā¦
ā¢ Fear of being judged, Fear of the unknown, Fear of the command line
ā¢ Laziness? - Adding extra steps to their workflow - āI have to do what now???ā
#rapids2018
@getdotmesh
43. There is a need for Reproducibility and
Provenance in Data Science
#rapids2018
@getdotmesh
44. There is a need for Reproducibility and
Provenance in Everything
#rapids2018
@getdotmesh
45. āOne of the largest sources of error in
[Data] Science results from computing
[and publishing] results from different
versions of the same data set.
#rapids2018
@getdotmesh
46. And using different versions of the
same softwareā¦
#rapids2018
@getdotmesh
47. And using different versions/implementations
of the āsameā algorithm
#rapids2018
@getdotmesh
48. And failing to capture & share all the
steps taken when building your
ML/AI modelā¦
Seed? Hyperparameters? Training Split? Features?
Precision? Recall? Time of Day?
#rapids2018
@getdotmesh
49. And failing to capture the state of your
Data at each iteration of the analysis...
#rapids2018
@getdotmesh
50. The wider community is aware of this
There are solutions
We are getting better at it
#rapids2018
@getdotmesh
51. Now over to Luke...
(No, not that one)
#rapids2018
@getdotmesh
52. Who am I?
Luke Marsden
Founder & CEO of dotmesh
ā¢ Hacker & entrepreneur
ā¢ Developed first storage system & volume plugin
system for Docker
ā¢ Kubernetes SIG lead
ā¢ Formerly Computer Science @ Oxford
#rapids2018
@getdotmesh
53. So you want to do reproducible data
science/AI/ML?
What do you need to pin down?
#rapids2018
@getdotmesh
54. So you want to do reproducible data
science/AI/ML?
Environment
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
55. So you want to do reproducible data
science/AI/ML?
Environment
Code
Including
parameters
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
56. So you want to do reproducible data
science/AI/ML?
Environment
Code
Including
parameters
Data
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
58. Pinning down environment
ā¢ In the DevOps world, Docker has been a big hit.
ā¢ Docker helps you pin down the execution
environment that your model training (or other
data work) is happening in.
ā¢ What is Docker?
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
59. What is docker?
ā¢ Like tiny frozen, runnable copies of your
computer's filesystem - e.g. Python libraries,
Python versions
ā¢ You can determine the exact version of all the
dependencies of your data science code
ā¢ You can build, ship & run exactly the same thing
anywhereā¦ your laptop, a cluster, or the cloud
ā¢ Dockerfile lets you declare what versions of
things you want; build a dockerfile from a docker
image and push it to a registry
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
60. Pinning down code
ā¢ For decades developers have been version
controlling their code.
ā¢ Tools like git are very popular.
#rapids2018
@getdotmesh
61. What is git?
ā¢ git looks kinda scary - but it's worth persisting
ā¢ In data science, it's not natural to commit every
time you change anything, e.g. while tuning
parameters...
ā¢ ...but you generate results while you're iterating
A version control system. Lets you track
versions of your code and collaborate with
others by commit, clone, push, pullā¦
Problems:
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
62. Pinning down data
ā¢ Method one: be very very organised
(meticulous folder structure)
+ Never overwrite filesā¦ backup
frequentlyā¦ and get your whole team
to do the same
ā¢ Method two: use versioned S3 buckets
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
63. What is S3?
ā¢ When you run an experiment, not natural to note
down all the object versions
ā¢ You generally care about the version of the
whole bucket, not every single individual object
(but S3 has no such notion)
ā¢ You could build a system to track this, but you've
got more important science to be doing...
A scalable filesystem on Amazon Web Services.
Store lots of data quite cheaply. Version your
objects (files) so that you can solve the problem
of data changing "under your feet".
Problems:
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
64. So you want to track provenance in data
science/AI/ML?
What do you need to pin down?
#rapids2018
@getdotmesh
65. So you want to track provenance in data
science/AI/ML?
Data A
Data B
Data C Code 2Code 1
Model 1
Model 2
Input
Input Output Output
Output
Input
#rapids2018
@getdotmesh
#rapids2018
@getdotmesh
66. Pinning down data provenance
ā¢ CWL, Pachyderm
ā¢ Require you to define the data
pipeline up front
If you can record the graph, you can point to any artefact/model and ask "show
me exactly where this came from"... the exact version of the tool which generated
it, what input data that tool used. And the transitive closure thereof.
Possible tools:
ā¢ You don't always know the data
pipeline up front
ā¢ Often you're figuring it out as you go
along, and it's evolving...
Problems:
#rapids2018
@getdotmesh
67. One more problem
ā¢ Sad fact. People don't care about reproducibility
as much as their day to day work (getting a paper
published, shipping an optimised model to
production, ā¦)
ā¢ Can we introduce reproducibility & provenance to
people while also helping them get work done
faster and more accurately? And collaborate
better with their team?
#rapids2018
@getdotmesh
68. The sweetener - track summary stats
ā¢ How do you track the
progress/performance/results of your models?
Your data science team?
ā¢ Answers ranged from āin a google
spreadsheetā to āin text fileā, āon a piece of
paperā or even āverballyā!
ā¢ Ideally, integrate summary stats tracking
into a solution...
We asked dozens of data scientists to describe
their workflows and their pain points.
One problem stood outā¦
#rapids2018
@getdotmesh
69. If only this was all a bit easier...
#rapids2018
@getdotmesh
70. The reason we're running this event today
#rapids2018
@getdotmesh
71. Introducing dotscience
ā¢ Tracks environment with Docker
ā¢ Tracks data in versioned S3 buckets +
dotmesh filesystem
ā¢ Tracks code versions which generate
summary stats in dotmesh + diff against git
ā¢ Integrates with Jupyter (RStudio & scripts
coming soon)
Environment
CodeData
Solves reproducibility:
#rapids2018
@getdotmesh
72. Introducing dotscience
ā¢ Builds the provenance graph on the fly
ā¢ For any dataset, see what code generated it
as the output of which other code,
transitively
ā¢ For any model, see exactly what code
generated it, and what data that model was
trained on
Solves provenance:
Data C
Code 2
Model 1 Model 2
Input
Output Output
#rapids2018
@getdotmesh
73. Introducing dotscience
ā¢ Builds a table and chart of every run.
Snapshots and keeps together:
+ versioned dataset
+ versioned model
+ all model parameters
+ compute environment
ā¢ See performance not just of your
work over time, but your whole team.
Solves summary stats tracking:
Who When Parameters Error rate
Alice 2 minutes ago filter_snps=150 60%
Bob 2 hours ago filter_snps=200 30%
Charlie 12 hours ago filter_snps=100 50%
#rapids2018
@getdotmesh