View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
1.
Fall 2009
Statistics 622
Module 8
Avoiding Over-Confidence
OVERVIEW .................................................................................................................................................................................. 2
FROM PREVIOUS CLASSES…..................................................................................................................................................... 3
OVER‐FITTING ........................................................................................................................................................................... 4
AN EXAMPLE OF OVER‐FITTING (NYSE_2003.JMP) ..................................................................................................... 5
VISUALIZATION .......................................................................................................................................................................... 9
COMMON SENSE TEST ............................................................................................................................................................10
EXAMPLE OF PREDICTING STOCKS ......................................................................................................................................11
WHAT ARE THOSE OTHER PREDICTORS?............................................................................................................................12
PROTECTION FROM OVER‐FITTING .....................................................................................................................................13
BONFERRONI = RIGHT ANSWER + ADDED BONUS ...........................................................................................................15
OTHER APPLICATIONS OF THE BONFERRONI RULE .........................................................................................................16
DETECTING OVER‐FITTING WITH A VALIDATION SAMPLE .............................................................................................17
CONTROLLING STEPWISE WITH A VALIDATION SAMPLE (BLOCK.JMP) .......................................................................19
BACK TO BUSINESS .................................................................................................................................................................23
APPENDIX BONFERRONI METHOD .......................................................................................................................................25
THE BONFERRONI INEQUALITY ............................................................................................................................................25
USE IN MODEL SELECTION ....................................................................................................................................................25
BONFERRONI RULE FOR P‐VALUES ......................................................................................................................................26
IT’S REALLY PRETTY GOOD ....................................................................................................................................................26
Copyright Robert A Stine
Revised 10/8/09
2.
Overview
Stepwise models
Select most predictive features from a list that you provide
of candidate features, incrementally improving the fit of the
model by as much as possible at each step.
When automated, the search continues so long as the
feature improves the model enough as gauged by its p-
value.
Over-fitting1
If the search is allowed to choose predictors too “easily”,
stepwise selection will identify predictors that ought not be
in the model, producing an artificially good fit when in fact
the model has been getting worse and worse.
Bonferroni rule
The Bonferroni rule lets us halt the search without having
to set aside a validation sample, allowing us use all the data
for finding a predictive model rather than a subset.
Though automatic, you should still use your knowledge of
the context to offer more informed choices of features to
consider for the modeling.
1
For another example of over-fitting when modeling stock returns, see BAUR pages 220-227.
Statistics 622 8-2 Fall 2009
3.
From previous classes…
Cost of uncertainty
An accurate estimate of mean demand improves profits.
Suggests that we should use more predictors in models,
including more combinations of features that capture
synergies among the features (interactions).
Stepwise regression
Automates the tedious process of working through the
various interactions and other candidate features.
Problem: Over-confidence
The combination of
Desire for more accurate predictions
+
Automated searches that maximize fitted R2
Creates the possibility that our predictions are not so
accurate as we think.
Over-fitting results when the modeling process leads us to
build a model that captures random patterns in the data that
will not be present in predicting new cases. The fit of the
model looks better on paper than in reality
Other situations with over-confidence
Subjective confidence intervals
Winners curse in auctions
Two methods for recognizing and avoiding over-fitting
Bonferroni p-values, which do not require the use of a
validation sample in order to test the model
Cross-validation, which requires setting aside data to test
the fit of a model.
Statistics 622 8-3 Fall 2009
4.
Over-fitting
False optimism
Is your model as good as it claims? Or, has your hard work
to improve its fit to the data exaggerated its accuracy?
“Optimization
capitalizes on When we use the same data to both fit and evaluate a
chance” model, we get an “optimistic” impression of how well the
model predicts. This process that leads to an exaggerated
sense of accuracy is known as over-fitting.
When a model has been over-fit, predictors that appear
significant from the output do not in fact improve the
model’s ability to predict the new cases.
Perhaps many of the predictors that are in a model have
arrived by chance alone because we have considered so
many possible models.
Over-fitting
Adding features to a model that improve its fit to the
observed data, but that degrade the ability of a model to
predict new cases.
Iterative refinement of a model (either manually or by an
automated algorithm) in order to improve the usual
summaries (e.g., R2 and p-values) typically generates a
better fit to the observed data that pick the predictors than
will be had when predicting new data.
No good deed goes unpunished!
It’s the process, not the model
Over-fitting does not happen if we pick a large group of
predictors and simply fit one big model, without iteratively
trying to improve its fit.
Statistics 622 8-4 Fall 2009
5.
An Example of Over-fitting (nyse_2003.jmp)
Stock market analysis
Over-fitting is common in domains in which there is a lot
of pressure to obtain accurate predictions, as in the case of
predicting the direction of the stock market.
Data: daily returns on the NYSE composite index in
October and November 2003.
Objective: Build a model to predict what will happen in
December 2003, using a battery of 12 trading rules (labeled
X1 to X12).
These are a few very basic technical trading rules.
Model selection criteria
Many numerical criteria have been proposed that can be
used as an alternative to maximizing R2 to judge the quality
of a good model.
This table lists several well-known criteria. To use these in
forward stepwise, control the forward search by using these
“Prob-to-enter” values.
Name Prob-to-Enter Approximate t- Idea
stat for inclusion
Adjusted .33 |t| > 1 Decrease RMSE
R2
AIC, Cp .16 |t| > √2 Unbiased estimate of
prediction accuracy
BIC Depends on n |t| > ½ log n Bayesian probability
Bonferroni 1/m |t| > √(2 log m) Minimize worst case,
family wide error rate
Statistics 622 8-5 Fall 2009
6.
Search domain for the example
Consider interactions among 12 exogenous features. The
total number of features available to stepwise is then
m = 12 + 12 + 12 × 11/2 = 24 + 66 = 90
Wide data set
There are 42 trading days in October and November. With
interactions, we have more features than cases to use.
m = 90 > n = 42
Hence we cannot fit the saturated model with all features.2
AIC criterion for forward search
Set “Prob to Enter” = 0.16 and run the search forward.
The stepwise search never stops!
A greedy search becomes gluttonous when offered so many
choices relative to the number of cases that are available.
2
You can show that often the best model is the so-called “saturated” model that has every feature
included as a predictor in the fit. But, you can only do this when you have more cases than
features, typically at least 3 per predictor (a crude rule of thumb for the ratio n/m).
Statistics 622 8-6 Fall 2009
7.
To avoid the cascade, make it harder to add a predictor;
reducing the “Prob to enter” to 0.10 gives this result:
The search stops after adding 20 predictors.
Optionally, following a common convention, we can “clean up”
the fit and make it appear more impressive by stepping
backward to remove collinear predictors that are redundant.
The backward elimination removes 3 predictors.
Statistics 622 8-7 Fall 2009
8.
Make the model and obtain the usual summary.
This “Summary of Fit” suggests a great model.
Any diagnostic procedure that ignores how we chose the
features to include in this model finds no problem. All
conclude that this is a great-fitting model, one that is highly
statistically significant.
Look at all of the predictors whose p-value < 0.0001.
These easily meet the Bonferroni threshold, when applied
after the fact.
Summary of Fit
RSquare 0.949
Root Mean Square Error 0.191
Mean of Response 0.177
Observations (or Sum Wgts) 42
Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio
Model 17 16.214437 0.953790 26.1361
Error 24 0.875838 0.036493 Prob > F
C. Total 41 17.090274 <.0001
Parameter Estimates
Term Est Std Err t Ratio Prob>|t|
Intercept -0.090 0.058 -1.56 0.1317
Exogenous 6 0.093 0.036 2.60 0.0156
Exogenous 9 0.256 0.046 5.59 <.0001
Exogenous 10 0.326 0.058 5.62 <.0001
(Exogenous 2-0.19088)*(Exogenous 3+0.07326) 0.192 0.035 5.52 <.0001
(Exogenous 2-0.19088)*(Exogenous 5-0.11786) 0.181 0.043 4.19 0.0003
(Exogenous 3+0.07326)*(Exogenous 5-0.11786) -0.209 0.038 -5.45 <.0001
(Exogenous 5-0.11786)*(Exogenous 6-0.07955) 0.178 0.030 5.88 <.0001
(Exogenous 8+0.13772)*(Exogenous 8+0.13772) 0.087 0.031 2.78 0.0105
(Exogenous 1+0.21142)*(Exogenous 9-0.32728) -0.412 0.048 -8.66 <.0001
(Exogenous 2-0.19088)*(Exogenous 9-0.32728) 0.198 0.044 4.51 0.0001
(Exogenous 5-0.11786)*(Exogenous 9-0.32728) 0.384 0.062 6.18 <.0001
(Exogenous 6-0.07955)*(Exogenous 10+0.03726) 0.183 0.036 5.05 <.0001
(Exogenous 7-0.23689)*(Exogenous 10+0.03726) 0.252 0.057 4.45 0.0002
(Exogenous 10+0.03726)*(Exogenous 10+0.03726) 0.202 0.027 7.38 <.0001
(Exogenous 2-0.19088)*(Exogenous 11+0.04288) -0.115 0.047 -2.46 0.0215
(Exogenous 6-0.07955)*(Exogenous 11+0.04288) 0.132 0.057 2.30 0.0304
(Exogenous 10+0.03726)*(Exogenous 12+0.18472) 0.263 0.046 5.69 <.0001
Statistics 622 8-8 Fall 2009
9.
Visualization
The surface contour shows that there’s a lot of curvature in the
fit of the model, but unlike the curvature seen in several prior
examples, the data do not seem to show visual evidence of the
curvature.
No pair of predictors appears particularly predictive,
although the overall model is.
This plot shows the curvature of the prediction formula
using predictors 8 and 10 along the bottom.3
3
Save the prediction formula from your regression model. Then select Graphics > Surface Plot
and fill the dialog for the variables with the prediction formula as well as the column that holds
the response data. To produce such a plot, you need a recent version of JMP.
Statistics 622 8-9 Fall 2009
10.
Common Sense Test: Hold-back some data
Question
Is this fit an example of the ability of multiple regression to
find “hidden effects” that simpler models miss?
There’s no real substance to rely upon to find an
explanation for the model. We have too many explanatory
variables than we can sensibly interpret.
Simple idea (cross-validation)
Reserve some data in order to test the model, such as the
next month of returns.
Fit model to a training/estimation sample, then predict
cases in test/validation sample.
Catch-22
How much to reserve, or set aside, for checking the model?
No clear-cut answer.
Save a little. This choice leaves too much variation in your
measure of how well the model has done. A model
might look good simply by chance. If we were to only
reserve, say, 5 cases to test the model, then it might
“get lucky” and predict these 5 well, simply by
chance.
Save a lot. This choice leaves too few cases available to
find good predictors. We end up with a good estimate
of the performance of a poor model. When trying to
improve a model or find complex effects, we’ll do
better with more data to identify the effects.
Statistics 622 8-10 Fall 2009
11.
Example of Predicting Stocks
What happens in December?
The model that looks so good on paper flops miserably
when put to this simple test. The fitted equation predicts the
estimation cases remarkably well, but produces large
prediction errors when extended out-of-sample to the next
month.
Plot of the prediction errors.
Left: in-sample errors, residuals from the fitted model.
Right: out-of-sample errors in the forecast period.4
The residuals are small during the estimation period
(October – November), in contrast to the size of the errors
when the model is used to predict the returns on the NYSE
during December.
5
4
3
Prediction Error
2
1
0
-1
-2
-3 October November December
-4
20031001
20031101
20031201
20040101
This model has been over-fit,Cal_Date
producing poor forecasts for
December. The usual summary statistics conceal the selection
process that was used to identify the model.
4
The horizontal gaps between the dots are the weekends or holidays.
Statistics 622 8-11 Fall 2009
12.
What are those other predictors?
Random noise!
The 12 basic features X1, X2, … X12 that were called
“technical trading rules” are in fact columns of simulated
samples from normal distributions.5
Any model that uses these as predictors over-fits the data.
But the final model looks so good!
True, but the out-of-sample predictions show how poor it
is. A better prediction would be to use the average of the
historical data instead.
In this example, we know (because the “exogenous rules”
are simulated random noise) that the true coefficients for
these variables are all zero.
Why doesn’t the final overall F-ratio find the problem?
The standard test statistics work “once”, as if you
postulated one model before you saw the data.
Stepwise tries hundreds of variables before choosing these.
Finding a p-value less than 0.05 is not unusual if you look
at, say, 100 possible features. Among these, you’d expect
to find 5 whose p-value < 0.05 by chance alone.
Cannot let stepwise procedure add such variables
In this example, the first step picks the worst variable: one
that adds actually adds nothing but claims to do a lot.
The effect of adding this spurious predictor is to bias the
estimate of error variation. That is, the RMSE is now
smaller than it should be.
The bias inflates the t-statistics for every other feature.
5
Thereby giving away my opinion of many technical trading rules.
Statistics 622 8-12 Fall 2009
13.
Source of the cascade
Suppose stepwise selection incorrectly picks a predictor
that it should not have, one for which β = 0.
The reason that it picks the wrong predictor is that, by
chance, this predictor explains a lot of variation (has a large
correlation with the response, here stock returns). The
predictor is useless out-of-sample but looks good within the
estimation sample.
As a result, the model looks better while at the same time
actually performing worse. The result is a biased estimate
of the amount of unexplained variation. RMSE gets
smaller when in fact the model fits worse; it should be
larger – not smaller – after adding this feature.
The biased RMSE, being too small, makes all of the other
features look better; t-statistics of features that are not in
the model suddenly get larger than they should be.
These inflated t-stats make it easier to add other useless
features to the model, forming a cascade as more spurious
predictors join the model. The EverReady bunny.
Protection from Over-fitting
Many have been “burned” by using a method like stepwise
regression and over-fitting. A frequently-heard complaint:
“The model looked fine when we built it, but when we
rolled it out in the field it failed completely. Statistics is
useless. Lies, damn lies, statistics.”
Protections from over-fitting include the following:
(a) Avoid automatic methods
Sure, and why not use an abacus, slide rule, and normal
table while you’re at it? It’s not the computer per se, but
Statistics 622 8-13 Fall 2009
14.
rather the shoddy way that we have used the automatic
search. The same concerns apply to tedious manual
searches as well.
(b) Arrogant: Stick to substantively-motivated predictors
Are you so confident that you know all there is to know
about which factors affect the response?
Particularly troubling when it comes to interactions.
Even so, you can use stepwise selection after picking a
model as a diagnostic. That is, use stepwise to learn
whether a substantively motivated model has missed
structure.
Start with a non-trivial substantively motivated model. It
should include the predictors that your knowledge of the
domain tells you belong. Then run stepwise to see whether
it finds other things that might be relevant.
(c) Cautious: Use a more stringent threshold
Add a feature only when the results are convincing that the
feature has a real effect, not a coincidence.
We can do this by using the Bonferroni rule. If you have a
list of m candidate features, then set “Prob to enter” =
0.05/m.
Statistics 622 8-14 Fall 2009
15.
Bonferroni = Right Answer + Added Bonus
What happens in the stock example?
Set the Prob-to-enter threshold to 0.05 divided by m,
number of features that are being considered.
In this example, the number of considered features is
12 “raw” + 12 “squares” + 12×11/2 “interactions”= 90
“Prob to enter” = 0.05/90 = .00056
Remove all of the predictors from the stepwise dialog,
change the “Prob to enter” field to 0.00056, and click go.6
The search finds the right answer: it adds nothing! No
predictor enters the model, and we’re left with a regression
with just an intercept.
None should be in the model; the “null model” is the truth.7
The “technical trading rules” used as predictors are random
noise, totally unrelated to the response.
Added bonus
The use of the Bonferroni rule for guiding the selection
process avoids the need to reserve a validation sample in
order to test your model and avoid over-fitting.
Just set the appropriate “Prob to enter” and use all of the
data to fit the model. A larger sample allows the modeling
to identify more subtle features that would otherwise be
missed.
6
JMP rounds the value input for p-to-enter that is shown in the box in the stepwise dialog, even
though the underlying code will use the value that you have entered.
7
Some of the predictors in the stepwise model claim to have p-values that pass the Bonferroni
rule. Once stepwise introduces noise into the regression, it can add more and more and these look
fine. You need to use Bonferroni before adding the variables, not after.
Statistics 622 8-15 Fall 2009
16.
Other Applications of the Bonferroni Rule
You can (and generally should) use the Bonferroni rule in
other situations in regression as well.
Any time that you look at a collection of p-values to judge
statistical significance, consider using a Bonferroni
adjustment to the p-values.
Testing in multiple regression
Suppose you fit a multiple regression with 5 predictors.
No selection or stepwise, just fit the model with these
predictors.
How should you judge the statistical results?
Two-stage process
(1) Check the overall F-ratio, shown in the Anova summary
of the model. This tests whether the R2 of the model is
large given the number of predictors in the fitted model
and the number of observations.
(2) If the overall F-ratio is statistically significant, then
consider the individual t-statistics for the coefficients
using a Bonferroni rule for these.
Suppose the model as a whole is significant, and you have
moved to the individual slopes. If you are looking at p-
values of a model with 5 predictors, then compare them to
0.05/5 = 0.01 before you get excited about finding a
statistically significant effect.
Tukey comparisons
The use of Tukey-Kramer comparisons among several
means is an alternative way to avoid claiming artificial
statistical significance in the specific case of comparing
many averages.
Statistics 622 8-16 Fall 2009
17.
Detecting Over-fitting with a Validation Sample
Bonferroni is not always possible.
Some methods do not allow this type of control on over-
fitting because they do not offer p-values.
Reserve a validation sample
It is common in time series modeling to set aside future
data to check the predictions from your model. We did it
with the stocks without giving it much thought.8
Divide the data set into two batches, one for fitting the
model and the second for evaluating the model.
The validation sample should be “locked away” excluded
from the modeling process, and certainly not “shown” to
the search procedure.
Software issues
JMP’s “Column Shuffle” command makes this separation
into two batches easy to do. For example:
This formula defines a column that labels a random sample
of 50 cases (rows) as validation cases, with the rest labeled
as estimation cases.9
Then use the “Exclude” & “Hide” commands from the
rows menu to set aside and conceal the validation cases.
8
At some point with time series models, you won’t be able to set aside data. If you’re trying to
predict tomorrow, do you really want to use a model built to data that is a month older?
9
Only 47 cases appear in the validation sample in the next example because it so happened that 3
excluded outliers fall among the validation cases.
Statistics 622 8-17 Fall 2009
18.
Questions when using a validation sample
1. How many observations should I put into the validation
sample.
2. How can I use the validation sample to identify over-
fitting?
In the blocks example introduced in Module 7, we have n =
200 runs to build a model.10 That produces the following
paradox:
If we set aside, say, half for validation, then we’ll have a
hard time finding good predictors.
On the other hand, if we only set aside, say, 10 cases for
validation, maybe these may be insufficient to give a
valid impression of how well the model has done. A fit
might do well on these 10 by chance.
Multi-fold cross-validation
A better alternative, if we had the software needed to
automate the process, repeats the validation process over
and over.
5-fold cross-validation:
Divide data into subsets, each with 20% of the cases.
Fit your model on 4 subsets, then predict the other. Do
this 5 times, each time omitting a different subset.
Accumulate the prediction errors.
Repeat!
10
So, why not go back to the client and say “I need more data.” Getting data is expensive unless
its already been captured in the system. Often, as in this example, the features for each run have
to be found by manually searching back through records.
Statistics 622 8-18 Fall 2009
19.
Controlling Stepwise with a Validation Sample (block.jmp)
Prior version of the cost-accounting model had 15 predictors
with an R2 of 69% and RMSE of $5.80.
Using the Bonferroni rule to control the stepwise search gives
the model shown on the next page…
It is hard to count how many predictors JMP can choose
from because categorical terms get turned into several
dummy variables. We can estimate m by counting the
number of “screens” needed to show the candidate features.
With m ≈ 385 features to consider, the Bonferroni threshold
for the “Prob to enter” criterion is
0.05/385 = 0.00013
The resulting model appears on the next page. The claimed
model is more parsimonious and does not claim the precision
produced by the prior search.
The model has 4 predictors, with R2 = 0.47, RMSE = $6.80
It also avoids weird variables like the type of music!
Statistics 622 8-19 Fall 2009
20.
Actual by Predicted Plot
80
70
Ave_Cost Actual
60
50
40
30
20
20 30 40 50 60 70 80
Ave_Cost Predicted P<.0001
RSq=0.47 RMSE=6.8343
Summary of Fit
RSquare 0.465
RSquare Adj 0.454
Root Mean Square Error 6.834
Mean of Response 39.694
Observations (or Sum Wgts) 197.000
Analysis of Variance
Source DF Sum of Squares Mean Square F Ratio
Model 4 7800.251 1950.06 41.7500
Error 192 8967.959 46.71 Prob > F
C. Total 196 16768.210 <.0001
Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t|
Intercept 20.22 1.84 10.97 <.0001
Labor_hrs 38.68 4.17 9.27 <.0001
(Abstemp-4.6)*(Abstemp-4.6) 0.07 0.01 6.09 <.0001
(Cost_Kg-1.8)*(Materialcost-2.3) 0.86 0.15 5.69 <.0001
(Manager{J-R&L}+0.22) * -372.50 89.07 -4.18 <.0001
(Brkdown/units-0.00634)
Statistics 622 8-20 Fall 2009
21.
Leverage plots suggest that the model has found some
additional highly leveraged points that were not identified
previously.
What should we do about these?
What can we learn from these?
Ave_Cost Leverage Residuals
Abstemp*Abstemp
70
60
50
40
30
20
35 40 45 50 55 60 65
Abstemp*Abstemp Leverage, P<
.0001
Cost_Kg*Materialcost
Ave_Cost Leverage Residuals
70
60
50
40
30
20
35 40 45 50 55
Cost_Kg*Materialcost Leverage,
P<.0001
Statistics 622 8-21 Fall 2009
22.
Visualization of the model reveals some of the structure of
the model..11 These plots are more interesting if you color-
code the points for old and new plants.
Do you see the two groups of points?
11
JMP will produce a surface plot only for models produced by Fit Model.
Statistics 622 8-22 Fall 2009
23.
Back to Business
Allure of fancy tools
It is easy to become so enamored by fancy tools that you
may lose sight of the problem that you’re trying to solve.
The client wants a model that predicts the cost of a
production run.
We’ve now learned enough to be able to return to the client
with questions of our own. We’re doing much better than
the naïve initial model (5 predictors, R2 = 0.30 versus the
improved model with only 4 predictors yet higher R2 =
0.47).
What questions should you ask the client in order to
understand what’s been found by the model?
What are those leveraged outliers?
What’s up with temperature controls? Do these have the
same effect in both plants. (You’ll have to do some data
analysis to answer this one.)
What do you make of the categorical factor?
In other words…
Stepwise methods leave ample opportunity to exploit what
you know about the context… You can design more
sensible features to consider by using what you “know”
about the problem.
Ideally, by simplifying the search for additional predictors,
stepwise methods (or other search technologies) allow you
to have more time to think about the modeling problem.
Here are a few substantively motivated comments:
Statistics 622 8-23 Fall 2009
24.
The features 1/Units and Breakdown/Units make more
sense (and are more interpretable) as ways of tracking
fixed costs.
Similarly, why use Cost/Kg when you can figure out the
material cost as the product cost/kg × weight?
Finally, make note of the so-called nesting of managers
within the different plants. Consider the following table:
Plant By Manager
Count JEAN LEE PAT RANDY TERRY
NEW 40 0 0 0 30 70
OLD 0 44 42 41 0 127
40 44 42 41 30 197
Jean and Terry work in the new plant, with the others
working in the old plant. Can you compare Jean to Lee,
for example? Or does that amount to comparing the two
plants?
These two features, Manager and Plant, are confounded
and cannot be separated by this analysis. (We can,
however, compare Jean to Terry since they do work in
the same plant.)
Statistics 622 8-24 Fall 2009
25.
Appendix: Bonferroni Method
The Bonferroni Inequality
The Bonferroni inequality (a.k.a., Boole’s inequality) gives a
simple upper bound for the probability of a union of events. If
you simply ignore the double counting, then it follows that
m
P(E1 or E 2 oror E m ) ≤ ∑ P(E j )
j=1
In the special case that all of the events have equal probability
p = P(Ej), we get the special case
€
P(E1 or E 2 oror E m ) ≤ m p
Use in Model Selection
In model selection for stepwise regression, we start with a list
€
of m possible features of the data that we consider for use in
the model. Often, this list will include interactions that we
want to have considered in the model, but are not really very
sure about.
If the list of possible predictors is large, then we need to avoid
“false positives”, adding a variable to the model that is not
actually helpful. Once the modeling begins to add unneeded
predictors, it tends to “cascade” by adding more and more.
We’ll avoid this by trying to never add a predictor that’s not
helpful.
Statistics 622 8-25 Fall 2009
26.
Bonferroni Rule for p-values
Let the events E1 through Em denote errors in the modeling,
adding the jth variable when it actually does not affect the
response. The chance for making any error when we consider
all m of these is then
P(some false positive) = P(E1 or E 2 oror E m )
≤mp
If we add a feature as a predictor in the model only if its p-
value is smaller than 0.05/m, say, then the chance for
€incorrectly including a predictor is less then
0.05
P(some false positive) ≤ m = 0.05
m
There’s only a 5% chance of making any mistake.
It’s really pretty good
€
Some would say that using this so-called “Bonferroni rule” is
too conservative: it makes it too hard to find useful predictors.
It’s actually not so bad.
(1) For example, suppose that we have m = 1000 possible
features to sort through. Then the Bonferroni rule says to only
add a feature if its p-value is smaller than 0.05/1000, 0.00005.
That seems really small at first, but convert it to a t-ratio.
How large (in absolute size) does the t-ratio need to be in
order for the p-value to be smaller than 0.00005? The answer
is about 4.6.
In other words, once the t-ratio is larger than around 5, a
model selection procedure will add the variable. A t-ratio
of 5 does not seem so unattainable. Sure, it requires a large
Statistics 622 8-26 Fall 2009
27.
effect, but with so many possibilities, we need to be
careful.
(2) Another way to see that Bonferroni is pretty good is to put
a lower bound on the probability of a false positive. If all of
the events are independent, then
P(some false positive) = 1− P(none)
= 1− P(E1c and E 2 and and E m )
c c
= 1− P(E1c ) × P(E 2 ) × × P(E m )
c c
= 1− (1− p) m
= 1− e m log(1− p )
≥ 1− e−m p
and the last step follows because log(1+x) ≤ x.
Combined with the Bonferroni inequality, we have (for
€ independent tests)
1− e−m p ≤ P(some false positive) ≤ m p
This table summarizes the implications. It shows that as n
grows and p gets smaller, the bounds from these inequalities
are really very tight.
€
m p mp Bounds
50 0.01 0.50 0.39 – 0.50
50 0.005 0.25 0.22 – 0.25
100 0.0001 0.01 0.0095 – 0.0100
Statistics 622 8-27 Fall 2009
Views
Actions
Embeds 0
Report content