Do Your
Homework!
Writing Tests for Data
Science and Natural
Language Processing
David Waterman
github.com/drwaterman/pydatadctesting
Agenda 1. Why test your code?
2. Our problem: text analysis, Natural
Language Processing
3. Practical Implementation
How Does
Homework
Work?
Why give students homework problems
and the matching solutions?
If you know
where you
are going,
you can tell if
you are
moving in the
right
direction
Many of our daily tasks are like
homework problems
Test Driven
Development
Cycle
Why Test
Your Code
For others:
 Reduce bugs
 Improve feedback
loops
 Speeds up iteration
 Makes code more
reusable
 Increases
confidence in the
system
For you:
 Earns trust and
confidence in you and
your work
 Earns respect from devs
and engineers (for whom
it’s not optional)
 Allows you to submit to
open source projects
 Probably required for
acceptance into external
code base
Agenda 1. Why test your code?
2. Our problem: text analysis,
Natural Language Processing
3. Practical Implementation
Supply
Chain
Insight
Who makes
what?
Identifying
Relationships
in Text
Challenges
• Per the long-term agreement,
WABCO will supply its single-piston
air disc brake (ADB) technology,
MAXX, for the manufacturing of
Hyundai’s new medium-duty trucks,
which are expected to start from
August 2019.
• Timken has become the sole
supplier of needle roller bearings to
Volkswagen Transmission.
• Dana Corp. has begun supplying
Ford Motor Co. with its
thermoplastic cylinder-head-cover
modules for the automaker's 3.0-
liter Duratec V-6 engine.
The structure of text is
domain specific.
Regular people don’t
talk like this:
Our
Approach:
“Gold” Tests
 Human reads the text
 Identifies relationship
from text
 Puts relationship into
machine-friendly
format (JSON, YAML)
 Writes a test for the
relationship
 Write and rewrite
code to pass the test
Agenda 1. Why test your code?
2. Our problem: text analysis, Natural
Language Processing
3. Practical Implementation
Recommendation:
Pytest as your
framework
➕ More Pythonic
➕ Easy to write fast - less
boilerplate
➕ Can still run unittests,
doctests, and nose
➕ Readable, pretty output
(including HTML reports)
➕ Great documentation &
guides
➖ It’s not a builtin
What to
Test
 Expected output
 Invalid input
 Edge cases
Data ModelsCode
 Data is valid
 Types are correct
 Missing values are
handled correctly
 Format is correct
 Produces
expected results
 Can be used to
benchmark
 Monitor for
model drift
EXAMPLES
Repo: https://github.com/drwaterman/pydatadctesting
Pytest
features
useful for
Data
Science
▪ Fixtures – For when you need something
repeatedly over multiple tests (Loading
test data, making a connection,
preprocessing data)
▪ Skip and Xfail – For when you know what
to test for but the code doesn’t pass yet
▪ Comparing images/plots
– Available in matplotlib
▪ Benchmarking a model
Some Nice
Pytest
Options
Save your pytest
command line
arguments in a
shell script
pytest --html=test-logs/testreport.html --self-
contained-html --cov=my_module --cov-report term-
missing -r aPp test
▪ --html: Where to save the html test report
▪ --self-contained-html: Save everything in one html
file (no external CSS, etc.)
▪ --cov=: What modules to include in the coverage
report
▪ --cov-report term-missing: Terminal report w/
missing line numbers
▪ -r aPp: display test results summary at the end
▪ test: The location in which to run the tests
CONCLUSION
 Pytest is easy!
 Start now
 It will earn you trust and respect
 It is possible to use it even if your code is
stochastic
Time for questions!

Do Your Homework! Writing tests for Data Science and Stochastic Code - David Waterman

  • 1.
    Do Your Homework! Writing Testsfor Data Science and Natural Language Processing David Waterman github.com/drwaterman/pydatadctesting
  • 2.
    Agenda 1. Whytest your code? 2. Our problem: text analysis, Natural Language Processing 3. Practical Implementation
  • 3.
    How Does Homework Work? Why givestudents homework problems and the matching solutions?
  • 4.
    If you know whereyou are going, you can tell if you are moving in the right direction
  • 5.
    Many of ourdaily tasks are like homework problems
  • 6.
  • 7.
    Why Test Your Code Forothers:  Reduce bugs  Improve feedback loops  Speeds up iteration  Makes code more reusable  Increases confidence in the system For you:  Earns trust and confidence in you and your work  Earns respect from devs and engineers (for whom it’s not optional)  Allows you to submit to open source projects  Probably required for acceptance into external code base
  • 8.
    Agenda 1. Whytest your code? 2. Our problem: text analysis, Natural Language Processing 3. Practical Implementation
  • 9.
  • 10.
  • 11.
    Challenges • Per thelong-term agreement, WABCO will supply its single-piston air disc brake (ADB) technology, MAXX, for the manufacturing of Hyundai’s new medium-duty trucks, which are expected to start from August 2019. • Timken has become the sole supplier of needle roller bearings to Volkswagen Transmission. • Dana Corp. has begun supplying Ford Motor Co. with its thermoplastic cylinder-head-cover modules for the automaker's 3.0- liter Duratec V-6 engine. The structure of text is domain specific. Regular people don’t talk like this:
  • 12.
    Our Approach: “Gold” Tests  Humanreads the text  Identifies relationship from text  Puts relationship into machine-friendly format (JSON, YAML)  Writes a test for the relationship  Write and rewrite code to pass the test
  • 13.
    Agenda 1. Whytest your code? 2. Our problem: text analysis, Natural Language Processing 3. Practical Implementation
  • 14.
    Recommendation: Pytest as your framework ➕More Pythonic ➕ Easy to write fast - less boilerplate ➕ Can still run unittests, doctests, and nose ➕ Readable, pretty output (including HTML reports) ➕ Great documentation & guides ➖ It’s not a builtin
  • 15.
    What to Test  Expectedoutput  Invalid input  Edge cases Data ModelsCode  Data is valid  Types are correct  Missing values are handled correctly  Format is correct  Produces expected results  Can be used to benchmark  Monitor for model drift
  • 16.
  • 17.
    Pytest features useful for Data Science ▪ Fixtures– For when you need something repeatedly over multiple tests (Loading test data, making a connection, preprocessing data) ▪ Skip and Xfail – For when you know what to test for but the code doesn’t pass yet ▪ Comparing images/plots – Available in matplotlib ▪ Benchmarking a model
  • 18.
    Some Nice Pytest Options Save yourpytest command line arguments in a shell script pytest --html=test-logs/testreport.html --self- contained-html --cov=my_module --cov-report term- missing -r aPp test ▪ --html: Where to save the html test report ▪ --self-contained-html: Save everything in one html file (no external CSS, etc.) ▪ --cov=: What modules to include in the coverage report ▪ --cov-report term-missing: Terminal report w/ missing line numbers ▪ -r aPp: display test results summary at the end ▪ test: The location in which to run the tests
  • 19.
    CONCLUSION  Pytest iseasy!  Start now  It will earn you trust and respect  It is possible to use it even if your code is stochastic Time for questions!