Dr. Datascience or: How I Learned to Stop Munging and Love Tests

Dr. Datascience
Or: How I Learned to Stop Munging and Love Tests
Mike Malecki (mike@crunch.io)
Neal Richardson (neal@crunch.io)

About us
• Political scientists
• Then worked in survey research industry
• Now in data product development
• Crunch.io

vs. “Faith-based coding”
• Misplaced faith in own infallability ✔︎
• Your code works because you believe it does
• Its output feels true

Tests
• Make the implicit explicit
• Turn assumptions into assertions
• Are a form of documentation
• Reduce complexity
• Are liberating

What are tests?
• Assertions, written in code, that your functions do what you expect
• That if you give certain inputs, you’ll get known, expected outputs
• That giving invalid input results in an expected failure
• Tests are code: code that must be run every time you make changes

Getting started
• Make a package

Getting started
• Make a package
source("mycode.R")
df <- read.csv("data.csv")
doThings(df)

Getting started
• Make a package. Not that different.
Use a package skeleton, such as https://github.com/nealrichardson/skeletor
library(rmycode)
df <- read.csv("data.csv")
doThings(df)

Testing flow
• Write test. Run it and see it fail.
• Write code that makes test pass.
• Run tests again. See them pass.
• Repeat

Example
Read and analyze AWS Elastic Load Balancer logs

Example
enpiar:c npr$ R -e 'skeletor::skeletor("elbr")'
enpiar:c npr$ cd elbr
enpiar:elbr npr$ atom .

Example
# elbr/tests/testthat/test-read.R
context("read.elb")
test_that("read.elb returns a data.frame", {
expect_true(is.data.frame(read.elb("example.log")))
})

Example
enpiar:elbr npr$ make test
...
Loading required package: elbr
read.elb: 1
Failed -------------------------------------------------------------------------
1. Error: read.elb returns a data.frame (@test-something.R#4) ------------------
could not find function "read.elb"
1: .handleSimpleError(function (e)
{
e$call <- sys.calls()[(frame + 11):(sys.nframe() - 2)]
register_expectation(e, frame + 11, sys.nframe() - 2)
signalCondition(e)
}, "could not find function "read.elb"", quote(eval(expr, envir, enclos))) at testthat/test-something.R:4
2: eval(expr, envir, enclos)
DONE ===========================================================================
Error: Test failures

Example
# elbr/R/read-elb.R
read.elb <- function (file, stringsAsFactors=FALSE, ...) {
read.delim(file,
sep=" ",
stringsAsFactors=stringsAsFactors,
col.names=c("timestamp", "elb", "client_port", "backend_port",
"request_processing_time", "backend_processing_time",
"response_processing_time", "elb_status_code",
"backend_status_code", "received_bytes", "sent_bytes",
"request", "user_agent", "ssl_cipher", "ssl_protocol"),
...)
}

Example
...
read.elb: .
DONE ===========================================================================

Example
test_that("read.elb returns a data.frame", {
df <- read.elb("example.log")
expect_true(is.data.frame(df))
expect_equal(dim(df), c(4, 15))
})

Example
...
read.elb: .1
Failed -------------------------------------------------------------------------
1. Failure: read.elb returns a data.frame (@test-something.R#6) ----------------
dim(df) not equal to c(4, 15).
1/2 mismatches
[1] 3 - 4 == -1
DONE ===========================================================================
Error: Test failures

Example
read.elb <- function (file, stringsAsFactors=FALSE, ...) {
read.delim(file,
sep=" ",
header=FALSE, # <-- Oh, right.
stringsAsFactors=stringsAsFactors,
col.names=c("timestamp", "elb", "client_port", "backend_port",
"request_processing_time", "backend_processing_time",
"response_processing_time", "elb_status_code",
"backend_status_code", "received_bytes", "sent_bytes",
"request", "user_agent", "ssl_cipher", "ssl_protocol"),
...)
}

Example
...
read.elb: ..
DONE ===========================================================================

Tests make explicit
• Tradeoffs everywhere ⚖
• is an integer an implicit categorical?
• Don’t try to be clever.

Tests assert
• You can assert dumb things like row counts
• Despite lubridate, is never simple
• Don’t be surprised by being wrong later

Tests document
• “I combined categories” aka “recode”
• The data itself doesn’t preserve this relationship
• Missingness is hard
• Did I already do it?

• df$col[df$col == 1 || df$col == 2] <- 1
• expect_equal(unique(col), 1:5)

Tests simplify
• Turn big, hard-to-reason-about problems into small ones
• expect_equal(dimnames(pred), dimnames(population))
num [1:4, 1:4, 1:6, 1:51, 1:3] 0.0196 0.0414 0.038 0.0106 0.0167 ...
- attr(*, "dimnames")=List of 5
..$ edu : chr [1:4] "<HS" "HS" "Some" "Grad"
..$ age : chr [1:4] "18-29" "30-44" "45-64" "≥65"
..$ race.female: chr [1:6] "White M" "Black M" "Hispanic M" "White F" ...
..$ state : chr [1:51] "AK" "AL" "AR" "AZ" ...
..$ party : chr [1:3] "R" "I" "D"

Tests liberate
• Free to extend your code without worrying about breaking what it already does
• Fix bugs and handle unforeseen complications only once

Because data contracts can't be trusted

Because you'll have to extend your code to do something
else

Because someone else will pick up your code in the future
Because that someone else could be your future self

Because you’re already testing, just not systematically

Dr. Datascience or: How I Learned to Stop Munging and Love Tests

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Dr. Datascience or: How I Learned to Stop Munging and Love Tests

Similar to Dr. Datascience or: How I Learned to Stop Munging and Love Tests (20)

More from Work-Bench

More from Work-Bench (8)

Recently uploaded

Recently uploaded (20)

Dr. Datascience or: How I Learned to Stop Munging and Love Tests