Why Data Science is not Software Development
and
How experiment management helps bring
science back to data science
kuba@neptune.ml
@NeptuneML
https://medium.com/neptune-ml
Jakub Czakon
● Why Data Science is not Software Development
● What is Experiment Management and how to use it
Agenda
Data Science vs Software Development
Development Science
● Feature scoping
● Extensive testing
● Code review and refactoring
● Extensive monitoring
● Exception handling
● Almost entire codebase used
● ...
● Explore
● Iteratively try ideas
● Communicate/question/analyse
results
● Almost entire codebase dropped
● ...
Data Science vs Software Development
“Render unto Software
development the things that are
Software Development's, and
unto Data Science the things
that are Data Science’s“
Development
Data Science Project
Science
● Data access/preprocessing
● Feature extraction
● Model inference pipelines/Rest API
● Exception handling
● Results monitoring
● Model retraining
● Resource provisioning
● ...
● Data exploration
● Hypothesis testing
● Feature prototyping/development
● Model prototyping/development
● Pipeline comparison
● Results exploration
● Problem understanding
● ...
Development
Data Science Project
Science
● When you know what to do
● When things will be done more than
once
● ...
● When you don’t yet know
● When things could end up being
done just once
● ...
Managing Data Science just like
Software Development is wrong
link
Managing Data Science just like
Software Development is wrong
Infamous quotes:
● “What insights will you be able to derive from this?”
● “How long will the data exploration take?”
● “When can we expect to have a working model?”
● “When will be able to improve by 20%?”
● “How is the MVP accuracy for this problem? 100%
What is your current human accuracy? I don’t know, maybe 70%”
Data Science is not just
machine learning
link to Quora
Data Science is
really close to business
release-0 release-1 release-2
exp-1.0
exp-1.1
exp-1.1.3
exp-1.1.2
exp-1.1.1
exp-1.1.2.1
exp-2.0
exp-2.1
exp-2.1.1
exp-2.1.0
science .how?
development .git
Data science projects are
experimental by design
In data science using just .git
makes keeping track of things .hard
release-0 release-1 release-2
exp-1.0
exp-1.1
exp-1.1.3
exp-1.1.2
exp-1.1.1
exp-1.1.2.1
exp-2.0
exp-2.1
exp-2.1.1
exp-2.1.0
?
To re-run you need more than just code
Experiment management
“Experiment management is a process of tracking experiment
metadata, organizing it in a meaningful way, and making it available
to access and collaborate on within your organization.”- me
What is experiment management?
No framework was hurt during
the production of this talk...
...but hopefully some will be
after
Track
18
Code:
● Version scripts
● Version notebooks:
○ nbdime, jupytext, neptune-notebooks
● Magic numbers -> hyperparameters
● Make sure your notebook runs top-to-bottom
jupyter nbconvert --to script nb.ipynb python nb.py; python nb.py
Hyperparameters:
● Everything goes into config
● If passed via command ->
automagically goes to config
● If passed via script ->
automagically goes into config
Bonus: hyperparameter optimization for free
HPO blog post series
Environment:
● Reproducible environment
preferably in- config
● Good options:
○ Docker
○ Conda
○ Makefiles
Metrics:
● Good validation >> insert smth
● Always be logging
● The more metrics the better
● Track training/validation/test
errors to estimate generalization
score = evaluate(model, valid_data)
exp.log('valid_auc', score)
youtube link
Data version:
● Storage is cheap(ish) >>
keep old versions
● Log data path
● Log data hash
train = pd.read_csv(TRAIN_PATH)
exp.log('data_path', TRAIN_PATH)
md5 = md5_from_file(TRAIN_PATH)
exp.log('data_version', md5)
Results exploration:
● Confusion matrix heatmap
● Predictions distributions
● Best/worst predictions
● In-train predictions
dist_fig = plot_dist(predictions)
exp.log('figure', dist_fig)
for pred_img in predictions:
exp.log('image_preds', pred_img)
26
Organize
Work in creative iterations
time, budget, business_goal = business_specification()
creative_idea = initial_research(business_goal)
while time and budget and not business_goal:
solution = develop(creative_idea)
metrics = evaluate(solution, validation_data)
if metrics > best_metrics:
best_metrics = metrics
best_solution = solution
creative_idea = explore_results(best_solution)
time.update()
budget.update()
Why explore results first?
Explore current results
Identify problems
Implement new idea
that solves the problem
Prioritize problems
Why explore results first?
Explore current results
Identify problems
Implement new idea
that solves the problem
Prioritize problems
Find awesome ideas
Twitter/Medium/Conf
Choose the coolest
method
Hope it solves all the
problems
Implement it
Good Bad
How to explore results?
● Internal
○ Diagnostic charts
○ Model comparisons
○ Permutation importance
○ Worts/best predictions
○ Shap values/eli5
● External
○ User/stakeholder feedback
Yellowbrick
scikit-plot
Altair
shap
eli5
Flask
BentoML
Why explore results at all?
● Know where your model fails
● Validate whether your model improved
where you wanted it to
● Formulate your next steps
● Cherry-pick good/bad/funny results
Connecting the dots
time, budget, business_goal = business_specification()
creative_idea = initial_research(business_goal)
while time and budget and not business_goal:
solution = develop(creative_idea, training_data)
metrics = evaluate(solution, validation_data)
if metrics > best_metrics:
best_metrics = metrics
best_solution = solution
creative_idea = explore_results(best_solution)
time.update()
budget.update()
● Tag with creative idea
● Log train and valid data versions
● Log metrics
● Log model and valid predictions
● Version results exploration
notebook
● Version code .git
Experiments are organized
Exploratory analysis is versioned
35
Collaborate
Central hub facilitates collaboration
Central Data
Science
project hub
Work is accessible
Work is accessible
● Slides link on Twitter @NeptuneML and Linkedin @neptune.ml
● My blog post on experiment management
● Example project with experiment management:
○ Code
○ Parameters
○ Environment
○ Data
○ … more
Materials
Data science collaboration hub
Track | Organize | Collaborate
kuba@neptune.ml
@NeptuneML
https://medium.com/neptune-ml
Jakub Czakon

Data science is not Software Development and how Experiment Management can make things better.

  • 1.
    Why Data Scienceis not Software Development and How experiment management helps bring science back to data science kuba@neptune.ml @NeptuneML https://medium.com/neptune-ml Jakub Czakon
  • 2.
    ● Why DataScience is not Software Development ● What is Experiment Management and how to use it Agenda
  • 3.
    Data Science vsSoftware Development
  • 4.
    Development Science ● Featurescoping ● Extensive testing ● Code review and refactoring ● Extensive monitoring ● Exception handling ● Almost entire codebase used ● ... ● Explore ● Iteratively try ideas ● Communicate/question/analyse results ● Almost entire codebase dropped ● ... Data Science vs Software Development
  • 5.
    “Render unto Software developmentthe things that are Software Development's, and unto Data Science the things that are Data Science’s“
  • 6.
    Development Data Science Project Science ●Data access/preprocessing ● Feature extraction ● Model inference pipelines/Rest API ● Exception handling ● Results monitoring ● Model retraining ● Resource provisioning ● ... ● Data exploration ● Hypothesis testing ● Feature prototyping/development ● Model prototyping/development ● Pipeline comparison ● Results exploration ● Problem understanding ● ...
  • 7.
    Development Data Science Project Science ●When you know what to do ● When things will be done more than once ● ... ● When you don’t yet know ● When things could end up being done just once ● ...
  • 8.
    Managing Data Sciencejust like Software Development is wrong link
  • 9.
    Managing Data Sciencejust like Software Development is wrong Infamous quotes: ● “What insights will you be able to derive from this?” ● “How long will the data exploration take?” ● “When can we expect to have a working model?” ● “When will be able to improve by 20%?” ● “How is the MVP accuracy for this problem? 100% What is your current human accuracy? I don’t know, maybe 70%”
  • 10.
    Data Science isnot just machine learning link to Quora
  • 11.
    Data Science is reallyclose to business
  • 12.
  • 13.
    In data scienceusing just .git makes keeping track of things .hard release-0 release-1 release-2 exp-1.0 exp-1.1 exp-1.1.3 exp-1.1.2 exp-1.1.1 exp-1.1.2.1 exp-2.0 exp-2.1 exp-2.1.1 exp-2.1.0 ?
  • 14.
    To re-run youneed more than just code
  • 15.
  • 16.
    “Experiment management isa process of tracking experiment metadata, organizing it in a meaningful way, and making it available to access and collaborate on within your organization.”- me What is experiment management?
  • 17.
    No framework washurt during the production of this talk... ...but hopefully some will be after
  • 18.
  • 19.
    Code: ● Version scripts ●Version notebooks: ○ nbdime, jupytext, neptune-notebooks ● Magic numbers -> hyperparameters ● Make sure your notebook runs top-to-bottom jupyter nbconvert --to script nb.ipynb python nb.py; python nb.py
  • 20.
    Hyperparameters: ● Everything goesinto config ● If passed via command -> automagically goes to config ● If passed via script -> automagically goes into config
  • 21.
    Bonus: hyperparameter optimizationfor free HPO blog post series
  • 22.
    Environment: ● Reproducible environment preferablyin- config ● Good options: ○ Docker ○ Conda ○ Makefiles
  • 23.
    Metrics: ● Good validation>> insert smth ● Always be logging ● The more metrics the better ● Track training/validation/test errors to estimate generalization score = evaluate(model, valid_data) exp.log('valid_auc', score) youtube link
  • 24.
    Data version: ● Storageis cheap(ish) >> keep old versions ● Log data path ● Log data hash train = pd.read_csv(TRAIN_PATH) exp.log('data_path', TRAIN_PATH) md5 = md5_from_file(TRAIN_PATH) exp.log('data_version', md5)
  • 25.
    Results exploration: ● Confusionmatrix heatmap ● Predictions distributions ● Best/worst predictions ● In-train predictions dist_fig = plot_dist(predictions) exp.log('figure', dist_fig) for pred_img in predictions: exp.log('image_preds', pred_img)
  • 26.
  • 27.
    Work in creativeiterations time, budget, business_goal = business_specification() creative_idea = initial_research(business_goal) while time and budget and not business_goal: solution = develop(creative_idea) metrics = evaluate(solution, validation_data) if metrics > best_metrics: best_metrics = metrics best_solution = solution creative_idea = explore_results(best_solution) time.update() budget.update()
  • 28.
    Why explore resultsfirst? Explore current results Identify problems Implement new idea that solves the problem Prioritize problems
  • 29.
    Why explore resultsfirst? Explore current results Identify problems Implement new idea that solves the problem Prioritize problems Find awesome ideas Twitter/Medium/Conf Choose the coolest method Hope it solves all the problems Implement it Good Bad
  • 30.
    How to exploreresults? ● Internal ○ Diagnostic charts ○ Model comparisons ○ Permutation importance ○ Worts/best predictions ○ Shap values/eli5 ● External ○ User/stakeholder feedback Yellowbrick scikit-plot Altair shap eli5 Flask BentoML
  • 31.
    Why explore resultsat all? ● Know where your model fails ● Validate whether your model improved where you wanted it to ● Formulate your next steps ● Cherry-pick good/bad/funny results
  • 32.
    Connecting the dots time,budget, business_goal = business_specification() creative_idea = initial_research(business_goal) while time and budget and not business_goal: solution = develop(creative_idea, training_data) metrics = evaluate(solution, validation_data) if metrics > best_metrics: best_metrics = metrics best_solution = solution creative_idea = explore_results(best_solution) time.update() budget.update() ● Tag with creative idea ● Log train and valid data versions ● Log metrics ● Log model and valid predictions ● Version results exploration notebook ● Version code .git
  • 33.
  • 34.
  • 35.
  • 36.
    Central hub facilitatescollaboration Central Data Science project hub
  • 37.
  • 38.
  • 39.
    ● Slides linkon Twitter @NeptuneML and Linkedin @neptune.ml ● My blog post on experiment management ● Example project with experiment management: ○ Code ○ Parameters ○ Environment ○ Data ○ … more Materials
  • 40.
    Data science collaborationhub Track | Organize | Collaborate kuba@neptune.ml @NeptuneML https://medium.com/neptune-ml Jakub Czakon