(Py)Testing the Limits of
Machine Learning
Rebecca Bilbro ⩓ Daniel Sollis ⩓ Patrick
Deziel
01. Introduction
Why test ML?
02.
DIY Testing API
Building blocks of a good
ML test suite
03.
Non-Determinism
Keeping your head when
the models act up
04.
Experiment with Care
ML diagnostics for
experimental robustness
05.
Conclusion
Level up your ML game
with these testing tips &
tricks
Why test ML?
01
Do we
need to
test ML
code?
“Testing is for software,
not data science.”
“It’s a waste of time to
test experimental
research code.”
“We follow hypothesis-driven
development, not test-driven
development.”
Can we
test ML
code?
“Machine learning algorithms are non-deterministic,
so there’s no way to test them.”
“Our Jupyter notebooks
don’t support test runners.”
“Machine learning has too many
parameters to test them all.”
Bottom Line
If it’s going into a product,
it needs to be tested.
Building blocks
of a good ML
test suite
02
Estimators and Transformers
Inheriting from the
Estimator() and
Transformer()
sklearn classes
allows you to
overload existing
methods.
Allows you to
generalize various
models and
transformations in
sklearn.
Doing this allows the
consistent use of
pipelines across
both preprocessing
as well as modeling.
Transformer
fit()
transform()
Estimator
fit()
predict()
X, y
X, y
ŷ
X′
Creating a Wrapper
ModelWrapper
fit() transform()
predict()
Transformer
Estimator
Estimator Transformer
Inheriting & Overloading
Pipelines and FeatureUnions
The Pipeline and
FeatureUnion features in
SKLearn allow you to
organize preprocessing
and modeling, letting you
quickly iterate through
experiments.
Pipelines are meant for
use with simple modeling,
while FeatureUnions are
meant for parallelizable
tasks. By creating a
wrapper class using these
features becomes even
easier.
Data Loader
Transformer
Transformer
Estimator
fit()
predict()
pipeline = Pipeline([
('extract_essays', EssayExtractor()),
('counts', CountVectorizer()),
('tf_idf', TfidfTransformer()),
('classifier', MultinomialNB())
])
pipeline.fit_transform(X_train, y_train)
y_pred = pipeline.predict()
Create a pipeline that
loads data from a file
on disk, extracts each
instance as an
individual essay, then
applies text feature
extraction before a
text classification
model.
Pipeline
Example
extract_essays
counts
tf_idf
classifier
http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
feature_union
extract_essays
counts
tf_idf
classifier
document meta concepts
DictVectorizer DictVectorizer
Feature
Union
pipeline = Pipeline([
('extract_essays', EssayExractor()),
('features', FeatureUnion([
('ngram_tf_idf', Pipeline([
('counts', CountVectorizer()),
('tf_idf', TfidfTransformer())
])),
('essay_length', LengthTransformer()),
('misspellings',
MispellingCountTransformer())
])),
('classifier', MultinomialNB())
])
We Use Pre-Commit in addition to
Black to ensure that our repository
stays clean and unified across
commits.
Coding Style and Enforcement
Part of Keeping our Standards high
is enforcing an agreed upon coding
style and sticking to it.
The Double Edged Sword of Black
python -m black '.file.py'
CI/CD With Jenkins
Using Jenkins for build testing helps
keep the whole team on the same
page as well as enforcing the teams
testing standards.
Automating builds in addition to
local testing helps to ensure that
code works in different
environments/machines.
Push
Pre-Commit
Black
Jenkins
Build/Testing
CICD Flow
Dealing with
Non-Determinism
03
Testing an ML Pipeline
● How do we handle non-determinism in our pipeline?
● How do we test multiple parameters in our pipeline?
● How do we handle small variations in our pipeline?
Scikit-learn
Pipeline
https://www.freecodecamp.org/news/chihuahua-or-muffin-my-search-for-the-best-computer-vision-api-cbda4d6b425d/
Different Data, Different Results
Scikit-learn
Pipeline
Muffin Dog
Scikit-learn
Pipeline
Muffin Dog
Train Test Test Train
Different Executions, Different Results
Train Test
Scikit-learn
Pipeline
Muffin Dog
Scikit-learn
Pipeline
Muffin Dog
Ensuring Reproducibility
● Fixing the random seed can ensure reproducibility across
executions of the same code.
● Scikit-learn provides a random_state parameter for each
non-deterministic function which allows the user to fix the
random seed.
class sklearn.neural_network.MLPClassifier(hidden_layer_sizes=100,
activation='relu', *, solver='adam', alpha=0.0001, batch_size='auto',
learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200,
shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False,
momentum=0.9, nesterovs_momentum=True, early_stopping=False,
validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08,
n_iter_no_change=10, max_fun=15000)
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
Using random_state
● Our function will now produce the same results on
different executions if we pass it the same data.
(Py)Testing Our Function
● ML comes with an abundance of options.
● How do we test multiple parameters without
turning our test code into spaghetti?
Using pytest.parametrize
Dealing With Inevitable Variations
● With floating point arithmetic, things can get...strange.
● In order to correctly test ML, we need a better way to
compare floating point results.
● We need a method of handling results that are “close
enough”.
○ E.g., Training time
Using pytest.approx
Diagnostics for
Machine
Learning
04
Engineering vs. Experimentation
What if it’s a false dichotomy?
Data Loader
Transformer(s)
Feature
Visualization
fit()
transform()
draw()
Data Loader
Transformer(s)
Estimator
Evaluation
Visualization
fit()
predict()
score()
draw()
The Yellowbrick API
dog
muffin
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import ClassificationReport
from sklearn.model_selection import train_test_split as tts
def muffins_or_dogs(X, y, model, classes=["dog", "muffin"]):
fig, ax = plt.subplots()
X_train, X_test, y_train, y_test = tts(X, y, random_state=38)
visualizer = ClassificationReport(
model, classes=classes, cmap="Greys", ax=ax,
support=True, show=False
)
visualizer.fit(X_train, y_train)
score = visualizer.score(X_test, y_test)
image_path = visualizer.estimator.__class__.__name__ + ".png"
visualizer.show(outpath=image_path)
return visualizer.estimator.predict(X_test)
Tips & Tricks
Leverage an ML API
Systematize tests by
wrapping open source ML
frameworks
Pipeline ML Steps
Chain ML steps to support
accuracy &
reproducibility
Drill into Fuzziness
Use parameterization &
approximation to deal with
non-determinism
Embrace Consistency
Adopt a team-wide
coding style to facilitate
collaboration
Befriend Small Robots
CI/CD helps flag test
regressions &
dependency changes
Experiment with Care
Use diagnostic tools
that don’t interfere
with testability
Thank you!
Template by SlidesGo
Icons by Flaticon
Images by Freepik

(Py)testing the Limits of Machine Learning

  • 1.
    (Py)Testing the Limitsof Machine Learning Rebecca Bilbro ⩓ Daniel Sollis ⩓ Patrick Deziel
  • 2.
    01. Introduction Why testML? 02. DIY Testing API Building blocks of a good ML test suite 03. Non-Determinism Keeping your head when the models act up 04. Experiment with Care ML diagnostics for experimental robustness 05. Conclusion Level up your ML game with these testing tips & tricks
  • 3.
  • 4.
    Do we need to testML code? “Testing is for software, not data science.” “It’s a waste of time to test experimental research code.” “We follow hypothesis-driven development, not test-driven development.”
  • 5.
    Can we test ML code? “Machinelearning algorithms are non-deterministic, so there’s no way to test them.” “Our Jupyter notebooks don’t support test runners.” “Machine learning has too many parameters to test them all.”
  • 6.
    Bottom Line If it’sgoing into a product, it needs to be tested.
  • 7.
    Building blocks of agood ML test suite 02
  • 8.
    Estimators and Transformers Inheritingfrom the Estimator() and Transformer() sklearn classes allows you to overload existing methods. Allows you to generalize various models and transformations in sklearn. Doing this allows the consistent use of pipelines across both preprocessing as well as modeling. Transformer fit() transform() Estimator fit() predict() X, y X, y ŷ X′
  • 9.
    Creating a Wrapper ModelWrapper fit()transform() predict() Transformer Estimator Estimator Transformer Inheriting & Overloading
  • 10.
    Pipelines and FeatureUnions ThePipeline and FeatureUnion features in SKLearn allow you to organize preprocessing and modeling, letting you quickly iterate through experiments. Pipelines are meant for use with simple modeling, while FeatureUnions are meant for parallelizable tasks. By creating a wrapper class using these features becomes even easier. Data Loader Transformer Transformer Estimator fit() predict()
  • 11.
    pipeline = Pipeline([ ('extract_essays',EssayExtractor()), ('counts', CountVectorizer()), ('tf_idf', TfidfTransformer()), ('classifier', MultinomialNB()) ]) pipeline.fit_transform(X_train, y_train) y_pred = pipeline.predict() Create a pipeline that loads data from a file on disk, extracts each instance as an individual essay, then applies text feature extraction before a text classification model. Pipeline Example extract_essays counts tf_idf classifier http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
  • 12.
    http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html feature_union extract_essays counts tf_idf classifier document meta concepts DictVectorizerDictVectorizer Feature Union pipeline = Pipeline([ ('extract_essays', EssayExractor()), ('features', FeatureUnion([ ('ngram_tf_idf', Pipeline([ ('counts', CountVectorizer()), ('tf_idf', TfidfTransformer()) ])), ('essay_length', LengthTransformer()), ('misspellings', MispellingCountTransformer()) ])), ('classifier', MultinomialNB()) ])
  • 13.
    We Use Pre-Commitin addition to Black to ensure that our repository stays clean and unified across commits. Coding Style and Enforcement Part of Keeping our Standards high is enforcing an agreed upon coding style and sticking to it.
  • 14.
    The Double EdgedSword of Black python -m black '.file.py'
  • 15.
    CI/CD With Jenkins UsingJenkins for build testing helps keep the whole team on the same page as well as enforcing the teams testing standards. Automating builds in addition to local testing helps to ensure that code works in different environments/machines. Push Pre-Commit Black Jenkins Build/Testing CICD Flow
  • 16.
  • 17.
    Testing an MLPipeline ● How do we handle non-determinism in our pipeline? ● How do we test multiple parameters in our pipeline? ● How do we handle small variations in our pipeline? Scikit-learn Pipeline https://www.freecodecamp.org/news/chihuahua-or-muffin-my-search-for-the-best-computer-vision-api-cbda4d6b425d/
  • 18.
    Different Data, DifferentResults Scikit-learn Pipeline Muffin Dog Scikit-learn Pipeline Muffin Dog Train Test Test Train
  • 19.
    Different Executions, DifferentResults Train Test Scikit-learn Pipeline Muffin Dog Scikit-learn Pipeline Muffin Dog
  • 20.
    Ensuring Reproducibility ● Fixingthe random seed can ensure reproducibility across executions of the same code. ● Scikit-learn provides a random_state parameter for each non-deterministic function which allows the user to fix the random seed. class sklearn.neural_network.MLPClassifier(hidden_layer_sizes=100, activation='relu', *, solver='adam', alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10, max_fun=15000) https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
  • 21.
    Using random_state ● Ourfunction will now produce the same results on different executions if we pass it the same data.
  • 22.
    (Py)Testing Our Function ●ML comes with an abundance of options. ● How do we test multiple parameters without turning our test code into spaghetti?
  • 23.
  • 24.
    Dealing With InevitableVariations ● With floating point arithmetic, things can get...strange. ● In order to correctly test ML, we need a better way to compare floating point results. ● We need a method of handling results that are “close enough”. ○ E.g., Training time
  • 25.
  • 26.
  • 27.
    Engineering vs. Experimentation Whatif it’s a false dichotomy?
  • 29.
  • 30.
  • 31.
    import matplotlib.pyplot asplt from sklearn.linear_model import SGDClassifier from sklearn.ensemble import RandomForestClassifier from yellowbrick.classifier import ClassificationReport from sklearn.model_selection import train_test_split as tts def muffins_or_dogs(X, y, model, classes=["dog", "muffin"]): fig, ax = plt.subplots() X_train, X_test, y_train, y_test = tts(X, y, random_state=38) visualizer = ClassificationReport( model, classes=classes, cmap="Greys", ax=ax, support=True, show=False ) visualizer.fit(X_train, y_train) score = visualizer.score(X_test, y_test) image_path = visualizer.estimator.__class__.__name__ + ".png" visualizer.show(outpath=image_path) return visualizer.estimator.predict(X_test)
  • 32.
    Tips & Tricks Leveragean ML API Systematize tests by wrapping open source ML frameworks Pipeline ML Steps Chain ML steps to support accuracy & reproducibility Drill into Fuzziness Use parameterization & approximation to deal with non-determinism Embrace Consistency Adopt a team-wide coding style to facilitate collaboration Befriend Small Robots CI/CD helps flag test regressions & dependency changes Experiment with Care Use diagnostic tools that don’t interfere with testability
  • 33.
    Thank you! Template bySlidesGo Icons by Flaticon Images by Freepik