Mistakes I've MadeMistakes I've Made
PyData Seattle 2015
Cam Davidson-Pilon
Who am I?Who am I?
Cam Davidson-Pilon
- Lead on the Data Team at Shopify
- Open source contributer
- Author of Bayesian Methods for Hackers
(in print soon!)
Ottawa
Ottawa
Ottawa?
Case Study 1Case Study 1
We needed to predict mail return
rates based on census data.
Sample Data (simplified):
Well I'm predicting the rate, so I
build that:
Don't need margin of errors...
...then do "data science"
Outcome: failure
What went wrong? At the time, ¯_(ツ)_/¯
(highly, highly recommended!)
σ =X¯
√n
σ
σ =X¯
√n
σ
"The std. deviation of the sample mean is
equal to the std. deviation of the
population over square-root n"
What I learned
1. Sample sizes are so important when dealing with
aggregate level data.
2. It was only an issue because the sample sizes were
different, too.
3. Use the Margin of Error, don't ignore it - it's there for a
reason.
4. I got burned so bad here, I became a Bayesian soon after.
Case Study 2Case Study 2
A intra-day time series of S&P, Dow,
Nasdaq and FTSE (UK index)
Suppose you are
interested in doing some
day trading. Your target:
UK stocks.
Futures on the FTSE in
particular.
Post Backtesting Results
Push to Production -
investing really money
What happened?
Data Leakage happened
What I learned
1. Your backtesting / cross validation will always be equal or
overly optimistic - plan for that.
2. Understand where your data comes from, from start to
finish.
Case Study 3Case Study 3
What I learned
1. When developing statistical software that already exists in
the wild, write tests against the output of that software.
2. Be responsible for your software:
Case Study 4Case Study 4
It was my first A/B test at
Shopify...
Control group: 4%
Experiment group: 5%
Bayesian A/B testing told me there was a
significant statistical difference between
the groups...
Upper management wanted
to know the relative increase...
(5% - 4%) / 4% = 25%
No.
We forgot sample size
again.
What I learned
1. Don't naively compute stats on top of stats - this only
compounds the uncertainty.
2. Better to underestimate than overestimate
3. Visualizing uncertainty is a the role of a statistician.
Machine LearningMachine Learning
counter examplescounter examples
Sparse-ing the
solution naively
Coefficients after linear regression*:
*Assume data has been normalized too,
i.e. mean 0 and standard deviation 1
Decide to drop a variable:
Suppose this is the true model...
Okay, out regression got the coefficients
right, but...
So actually, together, these variables have
very little contribution to Y!
Solution:
Any form of regularization will solve this.
For example, using ridge regression with
with even the slightest penalizer gives:
PCA before
Regression
PCA is great at many things, but it can
actually significantly hurt regression if
used as a preprocessing step. How?
Suppose we wish to regress Y onto X and W.
The true model of Y is Y = X - W. We don't know this
yet.
Suppose further there is a positive
correlation between X and W, say 0.5.
Apply PCA to [X W], we get a new matrix:
[ X + W, X − W]√2
1
√2
1
√2
1
√2
1
[ X + W, X − W]√2
1
√2
1
√2
1
√2
1
Textbook analysis tells you to drop the
second dimension from this new PCA.
[ X + W]√2
1
√2
1
So now we are regressing Y onto:
i.e., find values to fit the model:
Y = α + β(X + W)
But there are no good values for these
unknowns!
Quick
IPython
Demo
Solution:
Don't use naive PCA before regression, you
are losing information - try something like
supervised PCA, or just don't do it.
Thanks for listening :)
@cmrn_dp

Mistakes I've Made- Cam Davidson-Pilon