Mistakes I've Made- Cam Davidson-Pilon

Mistakes I've MadeMistakes I've Made
PyData Seattle 2015
Cam Davidson-Pilon

Who am I?Who am I?
Cam Davidson-Pilon
- Lead on the Data Team at Shopify
- Open source contributer
- Author of Bayesian Methods for Hackers
(in print soon!)

We needed to predict mail return
rates based on census data.
Sample Data (simpliﬁed):

Well I'm predicting the rate, so I
build that:

Don't need margin of errors...

Outcome: failure
What went wrong? At the time, ¯_(ツ)_/¯

σ =X¯
√n
σ
"The std. deviation of the sample mean is
equal to the std. deviation of the
population over square-root n"

What I learned
1. Sample sizes are so important when dealing with
aggregate level data.
2. It was only an issue because the sample sizes were
diﬀerent, too.
3. Use the Margin of Error, don't ignore it - it's there for a
reason.
4. I got burned so bad here, I became a Bayesian soon after.

Case Study 2Case Study 2
A intra-day time series of S&P, Dow,
Nasdaq and FTSE (UK index)

Suppose you are
interested in doing some
day trading. Your target:
UK stocks.
Futures on the FTSE in
particular.

Push to Production -
investing really money

What happened?
Data Leakage happened

What I learned
1. Your backtesting / cross validation will always be equal or
overly optimistic - plan for that.
2. Understand where your data comes from, from start to
ﬁnish.

What I learned
1. When developing statistical software that already exists in
the wild, write tests against the output of that software.
2. Be responsible for your software:

It was my first A/B test at
Shopify...
Control group: 4%
Experiment group: 5%
Bayesian A/B testing told me there was a
significant statistical difference between
the groups...

Upper management wanted
to know the relative increase...
(5% - 4%) / 4% = 25%

No.
We forgot sample size
again.

What I learned
1. Don't naively compute stats on top of stats - this only
compounds the uncertainty.
2. Better to underestimate than overestimate
3. Visualizing uncertainty is a the role of a statistician.

Machine LearningMachine Learning
counter examplescounter examples

Sparse-ing the
solution naively

Coeﬃcients after linear regression*:
*Assume data has been normalized too,
i.e. mean 0 and standard deviation 1

Suppose this is the true model...
Okay, out regression got the coeﬃcients
right, but...

So actually, together, these variables have
very little contribution to Y!

Solution:
Any form of regularization will solve this.
For example, using ridge regression with
with even the slightest penalizer gives:

PCA is great at many things, but it can
actually signiﬁcantly hurt regression if
used as a preprocessing step. How?

Suppose we wish to regress Y onto X and W.
The true model of Y is Y = X - W. We don't know this
yet.
Suppose further there is a positive
correlation between X and W, say 0.5.
Apply PCA to [X W], we get a new matrix:
[ X + W, X − W]√2
1
√2
1
√2
1
√2
1

[ X + W, X − W]√2
1
√2
1
√2
1
√2
1
Textbook analysis tells you to drop the
second dimension from this new PCA.

[ X + W]√2
1
√2
1
So now we are regressing Y onto:
i.e., ﬁnd values to ﬁt the model:
Y = α + β(X + W)
But there are no good values for these
unknowns!

Solution:
Don't use naive PCA before regression, you
are losing information - try something like
supervised PCA, or just don't do it.

Thanks for listening :)
@cmrn_dp

Mistakes I've Made- Cam Davidson-Pilon

More Related Content

Viewers also liked

Similar to Mistakes I've Made- Cam Davidson-Pilon

More from PyData

Recently uploaded

Mistakes I've Made- Cam Davidson-Pilon