Assumptions: Check yo'self before you wreck yourself

Assumptions:
Check yo self, before
you wreck yo self.
Erin Shellman @erinshellman
Seattle Software Craftsmanship
August 28, 2014
!

Assumptions:
Making an ass out of you
and me.
Erin Shellman @erinshellman
Seattle Software Craftsmanship
August 28, 2014
!

I’m Erin, and I’m a
data scientist.

…and when?
What about these?

Price optimization
1. Git yer
Big Data!

Price optimization
1. Git yer
Big Data!
2. Forecast
demand

Price optimization
1. Git yer
Big Data!
2. Forecast
demand
3. Optimize
price

Price optimization
4. Profit!!!!!
1.
Big Data!
2.
demand
3.
price

Price optimization
1. Git yer
Big Data!
2. Forecast
demand
3. Optimize
price
max
X
yi = !0 + !1xi + ✏i revenue

Do the easiest thing
•Subset the data and focus on one category of
product.
• e.g. Alpine ski bindings.
• Prototype & validate in R.
Units Soldi = α + β1(pricei) + εi

Do the easiest thing
•Subset the data and focus on one category of
product.
• e.g. Alpine ski bindings.
• Prototype & validate in R.
Units Soldi = α + β1(pricei) + εi
Residual

Assumptions of SLR
•We assume that residuals:
1.Normal, with mean zero.
2.Are not autocorrelated.
3.Are unrelated to the predictors.

Checking assumptions is
hard
•…and boring!
•For statistical methods, assumption
testing traditionally relies on
visually inspecting plots (and lets
be real, most people don’t even
do that).

40 60 80 100 120
0 500 1000 1500 2000 2500
Fitted values
Residuals
Residuals vs Fitted
117914
156
-3 -2 -1 0 1 2 3
0 2 4 6 8
Theoretical Quantiles
Standardized residuals
Normal Q-Q
194
171
156
40 60 80 100 120
0.0 0.5 1.0 1.5 2.0 2.5
Fitted values
Scale-Location
117914
156
0.00 0.01 0.02 0.03 0.04
0 2 4 6 8
Leverage
Cook's distance
1
0.5
Residuals vs Leverage
119741
109

OF all the practices you can
leverage to assist your
craftsmanship, you will get
the most benefit from testing.
!
Stephen Vance

test_that assumption!
context("Check assumptions of SLR")
!
test_that("The residuals are normally distributed", {
!
expect_that(shapiro.test(model_object$residuals)$p.value, is_more_than(0.05))
!
})
!
test_that("There is no autocorrelation", {
!
expect_that(lmtest::bgtest(model_object)$p.value, is_more_than(0.05))
!
})
!
test_that("The residuals are unrelated to the predictor", {
!
expect_that(cor(model_object$residuals, data$covariates), equals(0))
!
})
!

Tests pass!
> test_file("./tests/test_slr.R")
Check assumptions of SLR : [1] "units_sold ~ price"
...
!

Psych.
> test_file("./tests/test_slr.R")
1..
!!
1. Failure(@test_slr.R#12): The residuals are normally distributed
------------------------
shapiro.test(model_object$residuals)$p.value not more than 0.05. Difference: 0.05
!

Linear? Eh.
•We assumed the
2500
functional form was
2000
linear, but there are
1500
several common forms
1000
that might better fit the
500
data. 0
100 200 300 400 500
Price ($)
Units Sold

Price ($)
Units Sold
Price ($)
Units Sold
Price ($)
Units Sold
Price ($)
Units Sold
Linear Log-log
Linear-log Log-linear

Price ($)
Units Sold
Price ($)
Units Sold
Price ($)
Units Sold
Price ($)
Units Sold
Linear response to change in price. Much more sensitive to change in price.
More gradual response to changes in price Sensitive initially, then gradual

# Automagically explore SLR with common functional forms
candidate_models = list(linear = 'units_sold ~ price',
loglog = 'log(units_sold + 1) ~ log(price + 1)',
linearlog = 'units_sold ~ log(price + 1)',
loglinear = 'log(units_sold + 1) ~ price')
!
run = function(candidate_models, input_data) {
forecasts = list()
test_input = data.frame(price = 0:1000)
!
# Forecast
for (model in candidate_models) {
test_environment = new.env()
!
# Generate the forecast
forecasts[[model]] = generate_forecast(model, input_data)
!
# Save off current value of things for testing
assign("model", forecasts[[model]], envir = test_environment)
assign("errors", forecasts[[model]]$residuals, envir = test_environment)
assign("covariate", input_data$price, envir = test_environment)
assign("label", model, envir = test_environment)
!
save(test_environment, file = 'env_to_test.Rda')
!
# Run assumption tests
test_file("./tests/test_slr.R")
!
#### OPTIMIZE PRICE!!! ####
opt_results = optimizer(forecasts[[model]], test_input)
!
# Multiply the predicted demand by the price for expected revenue
opt_results$expected_revenue = test_data$price * opt_results$predicted_units_sold
!
pdf(paste(model, “.pdf”, sep = ‘’))
plot_price(opt_results)
!
}
!
return(forecasts)
!
}

rut roh…
> run(candidate_models, slr_data)
1..
!!
1. Failure(@test_slr.R#12): The residuals are normally distributed ---------------------------------
shapiro.test(linear$residuals)$p.value not more than 0.05. Difference: 0.05
!
Check assumptions of SLR : [1] "log(units_sold + 1) ~ log(price + 1)"
1.2
!!
!
2. Failure(@test_slr.R#24): The residuals are unrelated to the predictor ---------------------------
cor(test_environment$errors, test_environment$covariate) not equal to 0
Mean absolute difference: 0.05545615
!
Check assumptions of SLR : [1] "units_sold ~ log(price + 1)"
1.2
!!
!
2. Failure(@test_slr.R#24): The residuals are unrelated to the predictor ---------------------------
cor(test_environment$errors, test_environment$covariate) not equal to 0
Mean absolute difference: 0.04201906
!
Check assumptions of SLR : [1] "log(units_sold + 1) ~ price"
1..
!!

20000
15000
10000
5000
0
Linear Log-log
0 250 500 750 1000
Price ($)
Expected Revenue
15000
10000
5000
0
0 250 500 750 1000
Price ($)
Expected Revenue
6000
4000
2000
0
0 250 500 750 1000
Price ($)
Expected Revenue
60000
40000
20000
0
0 250 500 750 1000
Price ($)
Expected Revenue

20000
15000
10000
5000
Optimal Price = $322
0
Linear Log-log
0 250 500 750 1000
Price ($)
Expected Revenue
15000
10000
5000
0
0 250 500 750 1000
Price ($)
Expected Revenue
6000
4000
2000
0
0 250 500 750 1000
Price ($)
Expected Revenue
60000
40000
20000
0
0 250 500 750 1000
Price ($)
Expected Revenue
Optimal Price > $1000
Optimal Price = $∞
Optimal Price = $779

Mean = 185
40
30
20
10
0
100 200 300 400
Price ($)
Counts

In conclusion, these
forecasts suck.
We are just
getting
warmed up!

Beginner-Intermediate Intermediate-Advanced Advanced-Expert
2000
1500
1000
500
0
0 100 200 300 400 5000 100 200 300 400 5000 100 200 300 400 500
Price ($)
Units Sold

2011-06-01 2011-10-01 2012-02-01 2012-06-01 2012-10-01 2013-02-01 2013-06-01 2013-10-01 2014-02-01
Date
Units Sold

2011-06-01 2011-10-01 2012-02-01 2012-06-01 2012-10-01 2013-02-01 2013-06-01 2013-10-01 2014-02-01
Date
Units Sold
TIME?!

Try something a little
smarter
Units Soldi = α + β1(pricei) + β2(abilityi) + β3(monthi) + εi

Beginner-Intermediate Intermediate-Advanced Advanced-Expert
15000
10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
15000
10000
5000
0
1 2 3 4 5 6 7 8 9 10 11 12
0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000
Price ($)
Expected Revenue

Yeah, but who cares?
•Do we need to throw everything out
just because some assumptions are
invalidated?
•What is our goal?
•Is it still better than what we did
previously?

Wrap it up.
1. Do the easiest thing first, and do it well.
It’s how you’re going to learn the domain,
and it’s your benchmark for improvement.
2. Test your assumptions, and invest time in
building the tools needed to do that
effectively.
3. Be cool, stay in school.

Thanks bros!!
Nathan Decker, Brian Pratt & the Evo crew 
Jason Gowans & Bryan Mayer 
Elissa “Downtown” Brown, forecasting genius 
John Foreman, MailChimp 
#nordstromdatalab 

Click-bait!
1. Data Carpentry: http://mimno.infosci.cornell.edu/b/articles/carpentry/
2. Getting started with testthat. http://journal.r-project.org/archive/2011-1/
RJournal_2011-1_Wickham.pdf
3. Clean Code: http://www.amazon.com/Clean-Code-Handbook-Software-
Craftsmanship/dp/0132350882/
4. Quality Code: http://www.amazon.com/Quality-Code-Software-Principles-
Practices/dp/0321832981
5. Revenue Management: http://www.amazon.com/Practice-Management-
International-Operations-Research/dp/0387243763/
6. Pricing and Revenue Optimization: http://www.amazon.com/Pricing-Revenue-
Optimization-Robert-Phillips-ebook/dp/B005JTDOVE/
7. Original G, Rob Hyndman: https://www.otexts.org/fpp and http://
robjhyndman.com/hyndsight/

Assumptions: Check yo'self before you wreck yourself

More Related Content

What's hot

Viewers also liked

Similar to Assumptions: Check yo'self before you wreck yourself

Recently uploaded

Assumptions: Check yo'self before you wreck yourself