Working on data science projects that are run as if they were software development can sometimes feel like trying to fit a square peg in a round hole. In this talk, I will explain why that happens and what people do to try and fix it. Lately, in the context of machine learning, the concept of experiment management, which treats ml experiments as first-class citizens, has been gaining a lot of traction. I will discuss what it is, what are the benefits of using it, and how you can apply it in your work to make run your projects more efficiently.
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Data science is not Software Development and how Experiment Management can make things better.
1. Why Data Science is not Software Development
and
How experiment management helps bring
science back to data science
kuba@neptune.ml
@NeptuneML
https://medium.com/neptune-ml
Jakub Czakon
2. ● Why Data Science is not Software Development
● What is Experiment Management and how to use it
Agenda
4. Development Science
● Feature scoping
● Extensive testing
● Code review and refactoring
● Extensive monitoring
● Exception handling
● Almost entire codebase used
● ...
● Explore
● Iteratively try ideas
● Communicate/question/analyse
results
● Almost entire codebase dropped
● ...
Data Science vs Software Development
5. “Render unto Software
development the things that are
Software Development's, and
unto Data Science the things
that are Data Science’s“
6. Development
Data Science Project
Science
● Data access/preprocessing
● Feature extraction
● Model inference pipelines/Rest API
● Exception handling
● Results monitoring
● Model retraining
● Resource provisioning
● ...
● Data exploration
● Hypothesis testing
● Feature prototyping/development
● Model prototyping/development
● Pipeline comparison
● Results exploration
● Problem understanding
● ...
7. Development
Data Science Project
Science
● When you know what to do
● When things will be done more than
once
● ...
● When you don’t yet know
● When things could end up being
done just once
● ...
9. Managing Data Science just like
Software Development is wrong
Infamous quotes:
● “What insights will you be able to derive from this?”
● “How long will the data exploration take?”
● “When can we expect to have a working model?”
● “When will be able to improve by 20%?”
● “How is the MVP accuracy for this problem? 100%
What is your current human accuracy? I don’t know, maybe 70%”
13. In data science using just .git
makes keeping track of things .hard
release-0 release-1 release-2
exp-1.0
exp-1.1
exp-1.1.3
exp-1.1.2
exp-1.1.1
exp-1.1.2.1
exp-2.0
exp-2.1
exp-2.1.1
exp-2.1.0
?
16. “Experiment management is a process of tracking experiment
metadata, organizing it in a meaningful way, and making it available
to access and collaborate on within your organization.”- me
What is experiment management?
17. No framework was hurt during
the production of this talk...
...but hopefully some will be
after
19. Code:
● Version scripts
● Version notebooks:
○ nbdime, jupytext, neptune-notebooks
● Magic numbers -> hyperparameters
● Make sure your notebook runs top-to-bottom
jupyter nbconvert --to script nb.ipynb python nb.py; python nb.py
20. Hyperparameters:
● Everything goes into config
● If passed via command ->
automagically goes to config
● If passed via script ->
automagically goes into config
23. Metrics:
● Good validation >> insert smth
● Always be logging
● The more metrics the better
● Track training/validation/test
errors to estimate generalization
score = evaluate(model, valid_data)
exp.log('valid_auc', score)
youtube link
24. Data version:
● Storage is cheap(ish) >>
keep old versions
● Log data path
● Log data hash
train = pd.read_csv(TRAIN_PATH)
exp.log('data_path', TRAIN_PATH)
md5 = md5_from_file(TRAIN_PATH)
exp.log('data_version', md5)
27. Work in creative iterations
time, budget, business_goal = business_specification()
creative_idea = initial_research(business_goal)
while time and budget and not business_goal:
solution = develop(creative_idea)
metrics = evaluate(solution, validation_data)
if metrics > best_metrics:
best_metrics = metrics
best_solution = solution
creative_idea = explore_results(best_solution)
time.update()
budget.update()
28. Why explore results first?
Explore current results
Identify problems
Implement new idea
that solves the problem
Prioritize problems
29. Why explore results first?
Explore current results
Identify problems
Implement new idea
that solves the problem
Prioritize problems
Find awesome ideas
Twitter/Medium/Conf
Choose the coolest
method
Hope it solves all the
problems
Implement it
Good Bad
31. Why explore results at all?
● Know where your model fails
● Validate whether your model improved
where you wanted it to
● Formulate your next steps
● Cherry-pick good/bad/funny results
32. Connecting the dots
time, budget, business_goal = business_specification()
creative_idea = initial_research(business_goal)
while time and budget and not business_goal:
solution = develop(creative_idea, training_data)
metrics = evaluate(solution, validation_data)
if metrics > best_metrics:
best_metrics = metrics
best_solution = solution
creative_idea = explore_results(best_solution)
time.update()
budget.update()
● Tag with creative idea
● Log train and valid data versions
● Log metrics
● Log model and valid predictions
● Version results exploration
notebook
● Version code .git
39. ● Slides link on Twitter @NeptuneML and Linkedin @neptune.ml
● My blog post on experiment management
● Example project with experiment management:
○ Code
○ Parameters
○ Environment
○ Data
○ … more
Materials
40. Data science collaboration hub
Track | Organize | Collaborate
kuba@neptune.ml
@NeptuneML
https://medium.com/neptune-ml
Jakub Czakon