Taming the reproducibility crisis

www.scling.com
Taming the
reproducibility crisis
Nordic data science and machine learning summit
2019-10-16
Lars Albertsson, founder @ Scling
1

www.scling.com
A typical data science journey, phase one
● Data scientists in a corner
○ Weak engineering support
○ Weak product connection
● Manually exported data
● Single version of datasets
○ Good reproducibility
● Reuse data until overfit
2
The lab

www.scling.com
Tell us a story
● Great tool for
○ Displaying
○ Data story telling
○ Personal playground
● Less suitable for
○ Scientific experimentation
○ Collaboration
○ Production
● Hidden state, out-of-order execution, difficult to reuse, weak IDE, no lint, no modularity, scaling, …
○ Joel Grus: “I don’t like notebooks”, https://youtu.be/7jiPeIFXb6U
3

www.scling.com
A typical data science journey, next step
4
The lab
Data lake
Flowing
experiment

www.scling.com
A typical data science journey, phase two
● Data scientists + data engineers
○ Pipelines, fresh data
● Historical data. Whoosh!
● Train on all data until now
○ Changes every day
● Evaluate model on new data
○ Avoid manual overfit. Swell!
5
Data lake
Flowing
experiment

www.scling.com
Changing data, volatile workflows
● Data scientists + data engineers
○ Pipelines, fresh data
● Historical data. Whee!
● Train on all data until now
● Evaluate model on new data
○ Avoid manual overfit. Swell!
6

www.scling.com
Data unscience
7

www.scling.com
Data unengineering
8

www.scling.com
Data unengineering
9
How to I test the pipeline?
You temporarily change the
output path and run manually.
See if the output looks good.
Don’t do that.
What if I forget to change path?

www.scling.com
Typical data science journey, phase three
10
The lab
Flowing experiment
Data lake
Flowing
experiment
Integrated
iterative
Data lake

www.scling.com
Reproducibility starts to matter
● Initially large strides
● Diminishing returns →
Precision measurement →
Reproducible experiment
● Tame reproducibility or
slow innovation
11
Integrated
iterative
Data lake

www.scling.com
Reproducibility, technical view
● Train on known data
○ Batch only, never streaming
○ Explicitly enumerated datasets
○ No live sources
○ Not “all data that we have” or “latest state”
● Mastering workflow orchestration is key
● Lineage & provenance will become focus
○ Current tooling inadequate
● Random == not reproducible
○ But necessary for training
12

www.scling.com
Heathen data science - 2019
13
Please send me a copy of
the latest model.
20 steps from model
to production.
6 months to production.
Different data science /
development / QA teams.
Which data were used
to build this model?
Bash script for building
model from data.
No feedback loop from
operations to data scientists.
Which hyperparameters
should we use?
The model can only be
applied in this environment.
Code represented as
notebooks.

www.scling.com
Heathen software engineering - 1999
14
Please send me a copy of
the latest source code.
20 steps from source
to production.
Different development /
QA / operations teams.
Which files were used
to build this version?
Bash script for building
artifacts from source.
No feedback loop from
operations to development.
Which compiler flags
should we use?
6 month release cycle.
The build only works on this
machine.
Code represented as
UML models.

www.scling.com
Two decades of software engineering later
15
Team formation along
value streams - DevOps.
Everything as
configuration (or code),
in version control.
Immutable artifacts,
(hermetically) rebuilt
from source.
Continuous delivery (and
continuous integration)
with swift quality feedback.

www.scling.com
Development
Value stream team formation - DevOps
16
QA
Operations
Value stream
Product A
Product B
Product C

www.scling.com
Data science
Value stream team formation - DataOps
17
Data engineering
Operations
Value stream
Product A
Product B
Product C

www.scling.com
Immutable artifacts from source
18
Test Deploy
ELF
WAR
Container image
Source code
Build system
CI / CD pipeline
● Nobody wants to work without
○ But some still do
● Strong workflow from source
○ Not yet hermetic
● Immutable code artifacts

www.scling.com
Immutable models from raw data
19
Eval Deploy
Container image +
stored model
Cold store data
Workflow orchestration
Data pipeline
● Nobody wants to work without
○ Most are not aware
● Strong workflow from raw data
○ Hermetic?
● Immutable data artifacts
● Key component:
workflow orchestrator
● Train in batch
○ Reproducible
○ Infer in batch/stream/online

www.scling.com
Everything as config (or code)
● Business logic
● Test code + test data
● Application configuration
● Deployment & dev tooling
● Infrastructure
● Monitoring, alert, other ops
Some in weak languages (YAML, HCL).
Prefer code.
20

www.scling.com
Size = effort Credits: “Hidden Technical Debt in
Colour = code complexity Machine Learning Systems”,
Google, NIPS 2015
Everything as config (or code)
● Model code
● Test code + test data
○ Fuzzy testing -
solved problem
● Hyperparameters
● Deployment & dev tooling
● Infrastructure
● Monitoring, alert, other ops
21
Configuration Data collection
Monitoring
Serving
infrastructure
Feature extraction
Process
management tools
Analysis tools
Machine
resource
management
Data
verification
ML

www.scling.com
Continuous delivery (+ CI) with swift feedback
● Short time from code to
production feedback
● There is no tradeoff
speed vs reliability
22
Integrated
iterative

www.scling.com
Swift feedback for machine learning
● Siloed: 6+ months
Cultural work
● Autonomous: 1 month
Technical work, reproducibility
● Coordinated: days
23
Integrated
iterative
Data lake

www.scling.com
Skip to phase three
24
The lab
Flowing experiment
Data lake
Flowing
experiment
Integrated
iterative
Data lake

www.scling.com
Mix data scientists with developers, QA, ops
25
● Expect conflicts in work methods
● Facilitate mutual learning
● Limit scope of weak tools & workflows
○ But don’t force clunky tools on data scientists
Reproducibility is a technical problem with a human solution
Mark Coleman: Inextricably Linked: Reproducibility & Productivity in Data Science & AI
https://youtu.be/eORATxPx1Bw
Product

www.scling.com
Who’s talking?
Lars Albertsson, @lalleal
Ex: Google, Spotify, Schibsted, freelance
Founder:
Scling - data-value-as-a-service
● Siloed: 6+ months
● Autonomous: 1 month
● Coordinated: days
26
Integrated
Iterative
Data lake
Integrated
iterative

Taming the reproducibility crisis

More Related Content

What's hot

Similar to Taming the reproducibility crisis

More from Lars Albertsson

Recently uploaded

Taming the reproducibility crisis