Data science is not Software Development and how Experiment Management can make things better.

Why Data Science is not Software Development
and
How experiment management helps bring
science back to data science
kuba@neptune.ml
@NeptuneML
https://medium.com/neptune-ml
Jakub Czakon

● Why Data Science is not Software Development
● What is Experiment Management and how to use it
Agenda

Data Science vs Software Development

Development Science
● Feature scoping
● Extensive testing
● Code review and refactoring
● Extensive monitoring
● Exception handling
● Almost entire codebase used
● ...
● Explore
● Iteratively try ideas
● Communicate/question/analyse
results
● Almost entire codebase dropped
● ...
Data Science vs Software Development

“Render unto Software
development the things that are
Software Development's, and
unto Data Science the things
that are Data Science’s“

Development
Data Science Project
Science
● Data access/preprocessing
● Feature extraction
● Model inference pipelines/Rest API
● Exception handling
● Results monitoring
● Model retraining
● Resource provisioning
● ...
● Data exploration
● Hypothesis testing
● Feature prototyping/development
● Model prototyping/development
● Pipeline comparison
● Results exploration
● Problem understanding
● ...

Development
Data Science Project
Science
● When you know what to do
● When things will be done more than
once
● ...
● When you don’t yet know
● When things could end up being
done just once
● ...

Managing Data Science just like
Software Development is wrong
link

Managing Data Science just like
Software Development is wrong
Infamous quotes:
● “What insights will you be able to derive from this?”
● “How long will the data exploration take?”
● “When can we expect to have a working model?”
● “When will be able to improve by 20%?”
● “How is the MVP accuracy for this problem? 100%
What is your current human accuracy? I don’t know, maybe 70%”

Data Science is not just
machine learning
link to Quora

Data Science is
really close to business

release-0 release-1 release-2
exp-1.0
exp-1.1
exp-1.1.3
exp-1.1.2
exp-1.1.1
exp-1.1.2.1
exp-2.0
exp-2.1
exp-2.1.1
exp-2.1.0
science .how?
development .git
Data science projects are
experimental by design

In data science using just .git
makes keeping track of things .hard
release-0 release-1 release-2
exp-1.0
exp-1.1
exp-1.1.3
exp-1.1.2
exp-1.1.1
exp-1.1.2.1
exp-2.0
exp-2.1
exp-2.1.1
exp-2.1.0
?

To re-run you need more than just code

“Experiment management is a process of tracking experiment
metadata, organizing it in a meaningful way, and making it available
to access and collaborate on within your organization.”- me
What is experiment management?

No framework was hurt during
the production of this talk...
...but hopefully some will be
after

Code:
● Version scripts
● Version notebooks:
○ nbdime, jupytext, neptune-notebooks
● Magic numbers -> hyperparameters
● Make sure your notebook runs top-to-bottom
jupyter nbconvert --to script nb.ipynb python nb.py; python nb.py

Hyperparameters:
● Everything goes into config
● If passed via command ->
automagically goes to config
● If passed via script ->
automagically goes into config

Bonus: hyperparameter optimization for free
HPO blog post series

Environment:
● Reproducible environment
preferably in- conﬁg
● Good options:
○ Docker
○ Conda
○ Makeﬁles

Metrics:
● Good validation >> insert smth
● Always be logging
● The more metrics the better
● Track training/validation/test
errors to estimate generalization
score = evaluate(model, valid_data)
exp.log('valid_auc', score)
youtube link

Data version:
● Storage is cheap(ish) >>
keep old versions
● Log data path
● Log data hash
train = pd.read_csv(TRAIN_PATH)
exp.log('data_path', TRAIN_PATH)
md5 = md5_from_file(TRAIN_PATH)
exp.log('data_version', md5)

Results exploration:
● Confusion matrix heatmap
● Predictions distributions
● Best/worst predictions
● In-train predictions
dist_fig = plot_dist(predictions)
exp.log('figure', dist_fig)
for pred_img in predictions:
exp.log('image_preds', pred_img)

Work in creative iterations
time, budget, business_goal = business_specification()
creative_idea = initial_research(business_goal)
while time and budget and not business_goal:
solution = develop(creative_idea)
metrics = evaluate(solution, validation_data)
if metrics > best_metrics:
best_metrics = metrics
best_solution = solution
creative_idea = explore_results(best_solution)
time.update()
budget.update()

Why explore results ﬁrst?
Explore current results
Identify problems
Implement new idea
that solves the problem
Prioritize problems

Why explore results ﬁrst?
Explore current results
Identify problems
Implement new idea
that solves the problem
Prioritize problems
Find awesome ideas
Twitter/Medium/Conf
Choose the coolest
method
Hope it solves all the
problems
Implement it
Good Bad

How to explore results?
● Internal
○ Diagnostic charts
○ Model comparisons
○ Permutation importance
○ Worts/best predictions
○ Shap values/eli5
● External
○ User/stakeholder feedback
Yellowbrick
scikit-plot
Altair
shap
eli5
Flask
BentoML

Why explore results at all?
● Know where your model fails
● Validate whether your model improved
where you wanted it to
● Formulate your next steps
● Cherry-pick good/bad/funny results

Connecting the dots
time, budget, business_goal = business_specification()
creative_idea = initial_research(business_goal)
while time and budget and not business_goal:
solution = develop(creative_idea, training_data)
metrics = evaluate(solution, validation_data)
if metrics > best_metrics:
best_metrics = metrics
best_solution = solution
creative_idea = explore_results(best_solution)
time.update()
budget.update()
● Tag with creative idea
● Log train and valid data versions
● Log metrics
● Log model and valid predictions
● Version results exploration
notebook
● Version code .git

Exploratory analysis is versioned

Central hub facilitates collaboration
Central Data
Science
project hub

● Slides link on Twitter @NeptuneML and Linkedin @neptune.ml
● My blog post on experiment management
● Example project with experiment management:
○ Code
○ Parameters
○ Environment
○ Data
○ … more
Materials

Data science collaboration hub
Track | Organize | Collaborate
kuba@neptune.ml
@NeptuneML
https://medium.com/neptune-ml
Jakub Czakon

Data science is not Software Development and how Experiment Management can make things better.

More Related Content

What's hot

Similar to Data science is not Software Development and how Experiment Management can make things better.

Recently uploaded

Data science is not Software Development and how Experiment Management can make things better.