SlideShare a Scribd company logo
Fall 2009


Statistics 622
Module 8

Avoiding Over-Confidence
  OVERVIEW .................................................................................................................................................................................. 2 
  FROM PREVIOUS CLASSES…..................................................................................................................................................... 3 
  OVER‐FITTING ........................................................................................................................................................................... 4 
  AN EXAMPLE OF OVER‐FITTING   (NYSE_2003.JMP) ..................................................................................................... 5 
  VISUALIZATION .......................................................................................................................................................................... 9 
  COMMON SENSE TEST ............................................................................................................................................................10 
  EXAMPLE OF PREDICTING STOCKS ......................................................................................................................................11 
  WHAT ARE THOSE OTHER PREDICTORS?............................................................................................................................12 
  PROTECTION FROM OVER‐FITTING .....................................................................................................................................13 
  BONFERRONI = RIGHT ANSWER + ADDED BONUS ...........................................................................................................15 
  OTHER APPLICATIONS OF THE BONFERRONI RULE .........................................................................................................16 
  DETECTING OVER‐FITTING WITH A VALIDATION SAMPLE .............................................................................................17 
  CONTROLLING STEPWISE WITH A VALIDATION SAMPLE  (BLOCK.JMP) .......................................................................19 
  BACK TO BUSINESS .................................................................................................................................................................23 
  APPENDIX  BONFERRONI METHOD .......................................................................................................................................25 
  THE BONFERRONI INEQUALITY ............................................................................................................................................25 
  USE IN MODEL SELECTION ....................................................................................................................................................25 
  BONFERRONI RULE FOR P‐VALUES ......................................................................................................................................26 
  IT’S REALLY PRETTY GOOD ....................................................................................................................................................26 




Copyright Robert A Stine
Revised 10/8/09
Overview
     Stepwise models
        Select most predictive features from a list that you provide
        of candidate features, incrementally improving the fit of the
        model by as much as possible at each step.
        When automated, the search continues so long as the
        feature improves the model enough as gauged by its p-
        value.
     Over-fitting1
        If the search is allowed to choose predictors too “easily”,
        stepwise selection will identify predictors that ought not be
        in the model, producing an artificially good fit when in fact
        the model has been getting worse and worse.
     Bonferroni rule
        The Bonferroni rule lets us halt the search without having
        to set aside a validation sample, allowing us use all the data
        for finding a predictive model rather than a subset.
        Though automatic, you should still use your knowledge of
        the context to offer more informed choices of features to
        consider for the modeling.




1
    For another example of over-fitting when modeling stock returns, see BAUR pages 220-227.

Statistics 622                             8-2                                      Fall 2009
From previous classes…
   Cost of uncertainty
      An accurate estimate of mean demand improves profits.
      Suggests that we should use more predictors in models,
      including more combinations of features that capture
      synergies among the features (interactions).
   Stepwise regression
      Automates the tedious process of working through the
      various interactions and other candidate features.
   Problem: Over-confidence
      The combination of
                   Desire for more accurate predictions
                                     +
               Automated searches that maximize fitted R2
      Creates the possibility that our predictions are not so
      accurate as we think.
      Over-fitting results when the modeling process leads us to
      build a model that captures random patterns in the data that
      will not be present in predicting new cases. The fit of the
      model looks better on paper than in reality
   Other situations with over-confidence
      Subjective confidence intervals
      Winners curse in auctions
   Two methods for recognizing and avoiding over-fitting
      Bonferroni p-values, which do not require the use of a
      validation sample in order to test the model
      Cross-validation, which requires setting aside data to test
      the fit of a model.

Statistics 622                8-3                          Fall 2009
Over-fitting
                 False optimism
                     Is your model as good as it claims? Or, has your hard work
                     to improve its fit to the data exaggerated its accuracy?
“Optimization
capitalizes on       When we use the same data to both fit and evaluate a
chance”              model, we get an “optimistic” impression of how well the
                     model predicts. This process that leads to an exaggerated
                     sense of accuracy is known as over-fitting.
                     When a model has been over-fit, predictors that appear
                     significant from the output do not in fact improve the
                     model’s ability to predict the new cases.
                     Perhaps many of the predictors that are in a model have
                     arrived by chance alone because we have considered so
                     many possible models.
                 Over-fitting
                     Adding features to a model that improve its fit to the
                     observed data, but that degrade the ability of a model to
                     predict new cases.
                     Iterative refinement of a model (either manually or by an
                     automated algorithm) in order to improve the usual
                     summaries (e.g., R2 and p-values) typically generates a
                     better fit to the observed data that pick the predictors than
                     will be had when predicting new data.
                     No good deed goes unpunished!
                 It’s the process, not the model
                     Over-fitting does not happen if we pick a large group of
                     predictors and simply fit one big model, without iteratively
                     trying to improve its fit.


            Statistics 622                  8-4                           Fall 2009
An Example of Over-fitting                                      (nyse_2003.jmp)
   Stock market analysis
      Over-fitting is common in domains in which there is a lot
      of pressure to obtain accurate predictions, as in the case of
      predicting the direction of the stock market.
      Data: daily returns on the NYSE composite index in
      October and November 2003.
      Objective: Build a model to predict what will happen in
      December 2003, using a battery of 12 trading rules (labeled
      X1 to X12).
      These are a few very basic technical trading rules.
   Model selection criteria
      Many numerical criteria have been proposed that can be
      used as an alternative to maximizing R2 to judge the quality
      of a good model.
      This table lists several well-known criteria. To use these in
      forward stepwise, control the forward search by using these
      “Prob-to-enter” values.
          Name      Prob-to-Enter    Approximate t- Idea
                                    stat for inclusion
        Adjusted         .33               |t| > 1     Decrease RMSE
          R2
        AIC, Cp          .16               |t| > √2    Unbiased estimate of
                                                       prediction accuracy
         BIC        Depends on n       |t| > ½ log n   Bayesian probability
       Bonferroni       1/m         |t| > √(2 log m)   Minimize worst case,
                                                       family wide error rate




Statistics 622                       8-5                                  Fall 2009
Search domain for the example
      Consider interactions among 12 exogenous features. The
      total number of features available to stepwise is then
                 m = 12 + 12 + 12 × 11/2 = 24 + 66 = 90
    Wide data set
      There are 42 trading days in October and November. With
      interactions, we have more features than cases to use.
                             m = 90 > n = 42
      Hence we cannot fit the saturated model with all features.2
    AIC criterion for forward search
      Set “Prob to Enter” = 0.16 and run the search forward.




       The stepwise search never stops!
       A greedy search becomes gluttonous when offered so many
       choices relative to the number of cases that are available.


2
  You can show that often the best model is the so-called “saturated” model that has every feature
included as a predictor in the fit. But, you can only do this when you have more cases than
features, typically at least 3 per predictor (a crude rule of thumb for the ratio n/m).

Statistics 622                             8-6                                         Fall 2009
To avoid the cascade, make it harder to add a predictor;
   reducing the “Prob to enter” to 0.10 gives this result:




       The search stops after adding 20 predictors.

   Optionally, following a common convention, we can “clean up”
   the fit and make it appear more impressive by stepping
   backward to remove collinear predictors that are redundant.




       The backward elimination removes 3 predictors.


Statistics 622                 8-7                            Fall 2009
Make the model and obtain the usual summary.
     This “Summary of Fit” suggests a great model.
     Any diagnostic procedure that ignores how we chose the
     features to include in this model finds no problem. All
     conclude that this is a great-fitting model, one that is highly
     statistically significant.
     Look at all of the predictors whose p-value < 0.0001.
     These easily meet the Bonferroni threshold, when applied
     after the fact.
                               Summary of Fit
                   RSquare                          0.949
                   Root Mean Square Error           0.191
                   Mean of Response                 0.177
                   Observations (or Sum Wgts)          42
                            Analysis of Variance
        Source      DF    Sum of Squares Mean Square             F Ratio
        Model       17         16.214437        0.953790       26.1361
        Error       24          0.875838        0.036493       Prob > F
        C. Total    41         17.090274                        <.0001
                           Parameter Estimates
Term                                                 Est   Std Err t Ratio Prob>|t|
Intercept                                        -0.090     0.058 -1.56      0.1317
Exogenous 6                                       0.093     0.036     2.60   0.0156
Exogenous 9                                       0.256     0.046     5.59  <.0001
Exogenous 10                                      0.326     0.058     5.62  <.0001
(Exogenous 2-0.19088)*(Exogenous 3+0.07326)       0.192     0.035     5.52  <.0001
(Exogenous 2-0.19088)*(Exogenous 5-0.11786)       0.181     0.043     4.19   0.0003
(Exogenous 3+0.07326)*(Exogenous 5-0.11786)      -0.209     0.038 -5.45     <.0001
(Exogenous 5-0.11786)*(Exogenous 6-0.07955)       0.178     0.030     5.88  <.0001
(Exogenous 8+0.13772)*(Exogenous 8+0.13772)       0.087     0.031     2.78   0.0105
(Exogenous 1+0.21142)*(Exogenous 9-0.32728)      -0.412     0.048 -8.66     <.0001
(Exogenous 2-0.19088)*(Exogenous 9-0.32728)       0.198     0.044     4.51   0.0001
(Exogenous 5-0.11786)*(Exogenous 9-0.32728)       0.384     0.062     6.18  <.0001
(Exogenous 6-0.07955)*(Exogenous 10+0.03726)      0.183     0.036     5.05  <.0001
(Exogenous 7-0.23689)*(Exogenous 10+0.03726)      0.252     0.057     4.45   0.0002
(Exogenous 10+0.03726)*(Exogenous 10+0.03726)     0.202     0.027     7.38  <.0001
(Exogenous 2-0.19088)*(Exogenous 11+0.04288)     -0.115     0.047 -2.46      0.0215
(Exogenous 6-0.07955)*(Exogenous 11+0.04288)      0.132     0.057     2.30   0.0304
(Exogenous 10+0.03726)*(Exogenous 12+0.18472)     0.263     0.046     5.69  <.0001




Statistics 622                   8-8                                  Fall 2009
Visualization
    The surface contour shows that there’s a lot of curvature in the
    fit of the model, but unlike the curvature seen in several prior
    examples, the data do not seem to show visual evidence of the
    curvature.
        No pair of predictors appears particularly predictive,
        although the overall model is.




       This plot shows the curvature of the prediction formula
       using predictors 8 and 10 along the bottom.3


3
  Save the prediction formula from your regression model. Then select Graphics > Surface Plot
and fill the dialog for the variables with the prediction formula as well as the column that holds
the response data. To produce such a plot, you need a recent version of JMP.

Statistics 622                              8-9                                          Fall 2009
Common Sense Test: Hold-back some data
   Question
     Is this fit an example of the ability of multiple regression to
     find “hidden effects” that simpler models miss?
     There’s no real substance to rely upon to find an
     explanation for the model. We have too many explanatory
     variables than we can sensibly interpret.
   Simple idea (cross-validation)
     Reserve some data in order to test the model, such as the
     next month of returns.
     Fit model to a training/estimation sample, then predict
     cases in test/validation sample.
   Catch-22
     How much to reserve, or set aside, for checking the model?
     No clear-cut answer.
     Save a little. This choice leaves too much variation in your
           measure of how well the model has done. A model
           might look good simply by chance. If we were to only
           reserve, say, 5 cases to test the model, then it might
           “get lucky” and predict these 5 well, simply by
           chance.
     Save a lot. This choice leaves too few cases available to
           find good predictors. We end up with a good estimate
           of the performance of a poor model. When trying to
           improve a model or find complex effects, we’ll do
           better with more data to identify the effects.




Statistics 622                8-10                           Fall 2009
Example of Predicting Stocks
     What happens in December?
        The model that looks so good on paper flops miserably
        when put to this simple test. The fitted equation predicts the
        estimation cases remarkably well, but produces large
        prediction errors when extended out-of-sample to the next
        month.
     Plot of the prediction errors.
        Left: in-sample errors, residuals from the fitted model.
        Right: out-of-sample errors in the forecast period.4
        The residuals are small during the estimation period
        (October – November), in contrast to the size of the errors
        when the model is used to predict the returns on the NYSE
        during December.
                                5
                                4
                                3
             Prediction Error




                                2
                                1
                                0
                                -1
                                -2
                                -3              October              November              December
                                -4
                                     20031001




                                                          20031101




                                                                                20031201




                                                                                                      20040101




     This model has been over-fit,Cal_Date
                                    producing poor forecasts for
     December. The usual summary statistics conceal the selection
     process that was used to identify the model.

4
    The horizontal gaps between the dots are the weekends or holidays.

Statistics 622                                               8-11                                       Fall 2009
What are those other predictors?
     Random noise!
       The 12 basic features X1, X2, … X12 that were called
       “technical trading rules” are in fact columns of simulated
       samples from normal distributions.5
       Any model that uses these as predictors over-fits the data.
     But the final model looks so good!
       True, but the out-of-sample predictions show how poor it
       is. A better prediction would be to use the average of the
       historical data instead.
       In this example, we know (because the “exogenous rules”
       are simulated random noise) that the true coefficients for
       these variables are all zero.
     Why doesn’t the final overall F-ratio find the problem?
       The standard test statistics work “once”, as if you
       postulated one model before you saw the data.
       Stepwise tries hundreds of variables before choosing these.
       Finding a p-value less than 0.05 is not unusual if you look
       at, say, 100 possible features. Among these, you’d expect
       to find 5 whose p-value < 0.05 by chance alone.
     Cannot let stepwise procedure add such variables
       In this example, the first step picks the worst variable: one
       that adds actually adds nothing but claims to do a lot.
       The effect of adding this spurious predictor is to bias the
       estimate of error variation. That is, the RMSE is now
       smaller than it should be.
       The bias inflates the t-statistics for every other feature.
5
    Thereby giving away my opinion of many technical trading rules.

Statistics 622                             8-12                       Fall 2009
Source of the cascade
     Suppose stepwise selection incorrectly picks a predictor
     that it should not have, one for which β = 0.
     The reason that it picks the wrong predictor is that, by
     chance, this predictor explains a lot of variation (has a large
     correlation with the response, here stock returns). The
     predictor is useless out-of-sample but looks good within the
     estimation sample.
     As a result, the model looks better while at the same time
     actually performing worse. The result is a biased estimate
     of the amount of unexplained variation. RMSE gets
     smaller when in fact the model fits worse; it should be
     larger – not smaller – after adding this feature.
     The biased RMSE, being too small, makes all of the other
     features look better; t-statistics of features that are not in
     the model suddenly get larger than they should be.
     These inflated t-stats make it easier to add other useless
     features to the model, forming a cascade as more spurious
     predictors join the model. The EverReady bunny.
Protection from Over-fitting
   Many have been “burned” by using a method like stepwise
   regression and over-fitting. A frequently-heard complaint:
      “The model looked fine when we built it, but when we
      rolled it out in the field it failed completely. Statistics is
      useless. Lies, damn lies, statistics.”
   Protections from over-fitting include the following:
     (a) Avoid automatic methods
      Sure, and why not use an abacus, slide rule, and normal
      table while you’re at it? It’s not the computer per se, but

Statistics 622                 8-13                            Fall 2009
rather the shoddy way that we have used the automatic
      search. The same concerns apply to tedious manual
      searches as well.
     (b) Arrogant: Stick to substantively-motivated predictors
      Are you so confident that you know all there is to know
      about which factors affect the response?
      Particularly troubling when it comes to interactions.
      Even so, you can use stepwise selection after picking a
      model as a diagnostic. That is, use stepwise to learn
      whether a substantively motivated model has missed
      structure.
      Start with a non-trivial substantively motivated model. It
      should include the predictors that your knowledge of the
      domain tells you belong. Then run stepwise to see whether
      it finds other things that might be relevant.
     (c) Cautious: Use a more stringent threshold
      Add a feature only when the results are convincing that the
      feature has a real effect, not a coincidence.
      We can do this by using the Bonferroni rule. If you have a
      list of m candidate features, then set “Prob to enter” =
      0.05/m.




Statistics 622               8-14                         Fall 2009
Bonferroni = Right Answer + Added Bonus
    What happens in the stock example?
      Set the Prob-to-enter threshold to 0.05 divided by m,
      number of features that are being considered.
      In this example, the number of considered features is
       12 “raw” + 12 “squares” + 12×11/2 “interactions”= 90
            “Prob to enter” = 0.05/90 = .00056
      Remove all of the predictors from the stepwise dialog,
      change the “Prob to enter” field to 0.00056, and click go.6
      The search finds the right answer: it adds nothing! No
      predictor enters the model, and we’re left with a regression
      with just an intercept.
    None should be in the model; the “null model” is the truth.7
      The “technical trading rules” used as predictors are random
      noise, totally unrelated to the response.
    Added bonus
      The use of the Bonferroni rule for guiding the selection
      process avoids the need to reserve a validation sample in
      order to test your model and avoid over-fitting.
      Just set the appropriate “Prob to enter” and use all of the
      data to fit the model. A larger sample allows the modeling
      to identify more subtle features that would otherwise be
      missed.


6
  JMP rounds the value input for p-to-enter that is shown in the box in the stepwise dialog, even
though the underlying code will use the value that you have entered.
7
  Some of the predictors in the stepwise model claim to have p-values that pass the Bonferroni
rule. Once stepwise introduces noise into the regression, it can add more and more and these look
fine. You need to use Bonferroni before adding the variables, not after.

Statistics 622                            8-15                                        Fall 2009
Other Applications of the Bonferroni Rule
   You can (and generally should) use the Bonferroni rule in
   other situations in regression as well.
      Any time that you look at a collection of p-values to judge
      statistical significance, consider using a Bonferroni
      adjustment to the p-values.
   Testing in multiple regression
      Suppose you fit a multiple regression with 5 predictors.
      No selection or stepwise, just fit the model with these
      predictors.
      How should you judge the statistical results?
   Two-stage process
      (1) Check the overall F-ratio, shown in the Anova summary
          of the model. This tests whether the R2 of the model is
          large given the number of predictors in the fitted model
          and the number of observations.
      (2) If the overall F-ratio is statistically significant, then
          consider the individual t-statistics for the coefficients
          using a Bonferroni rule for these.
      Suppose the model as a whole is significant, and you have
      moved to the individual slopes. If you are looking at p-
      values of a model with 5 predictors, then compare them to
      0.05/5 = 0.01 before you get excited about finding a
      statistically significant effect.
   Tukey comparisons
      The use of Tukey-Kramer comparisons among several
      means is an alternative way to avoid claiming artificial
      statistical significance in the specific case of comparing
      many averages.

Statistics 622               8-16                          Fall 2009
Detecting Over-fitting with a Validation Sample
    Bonferroni is not always possible.
      Some methods do not allow this type of control on over-
      fitting because they do not offer p-values.
    Reserve a validation sample
      It is common in time series modeling to set aside future
      data to check the predictions from your model. We did it
      with the stocks without giving it much thought.8
      Divide the data set into two batches, one for fitting the
      model and the second for evaluating the model.
      The validation sample should be “locked away” excluded
      from the modeling process, and certainly not “shown” to
      the search procedure.
    Software issues
      JMP’s “Column Shuffle” command makes this separation
      into two batches easy to do. For example:



       This formula defines a column that labels a random sample
       of 50 cases (rows) as validation cases, with the rest labeled
       as estimation cases.9
       Then use the “Exclude” & “Hide” commands from the
       rows menu to set aside and conceal the validation cases.


8
  At some point with time series models, you won’t be able to set aside data. If you’re trying to
predict tomorrow, do you really want to use a model built to data that is a month older?
9
  Only 47 cases appear in the validation sample in the next example because it so happened that 3
excluded outliers fall among the validation cases.

Statistics 622                            8-17                                        Fall 2009
Questions when using a validation sample
       1. How many observations should I put into the validation
       sample.
       2. How can I use the validation sample to identify over-
       fitting?
       In the blocks example introduced in Module 7, we have n =
       200 runs to build a model.10 That produces the following
       paradox:
          If we set aside, say, half for validation, then we’ll have a
          hard time finding good predictors.
          On the other hand, if we only set aside, say, 10 cases for
          validation, maybe these may be insufficient to give a
          valid impression of how well the model has done. A fit
          might do well on these 10 by chance.
     Multi-fold cross-validation
       A better alternative, if we had the software needed to
       automate the process, repeats the validation process over
       and over.
       5-fold cross-validation:
             Divide data into subsets, each with 20% of the cases.
             Fit your model on 4 subsets, then predict the other. Do
             this 5 times, each time omitting a different subset.
             Accumulate the prediction errors.
             Repeat!



10
   So, why not go back to the client and say “I need more data.” Getting data is expensive unless
its already been captured in the system. Often, as in this example, the features for each run have
to be found by manually searching back through records.

Statistics 622                             8-18                                        Fall 2009
Controlling Stepwise with a Validation Sample               (block.jmp)

   Prior version of the cost-accounting model had 15 predictors
   with an R2 of 69% and RMSE of $5.80.
   Using the Bonferroni rule to control the stepwise search gives
   the model shown on the next page…
      It is hard to count how many predictors JMP can choose
      from because categorical terms get turned into several
      dummy variables. We can estimate m by counting the
      number of “screens” needed to show the candidate features.
      With m ≈ 385 features to consider, the Bonferroni threshold
      for the “Prob to enter” criterion is
                            0.05/385 = 0.00013




  The resulting model appears on the next page. The claimed
  model is more parsimonious and does not claim the precision
  produced by the prior search.
      The model has 4 predictors, with R2 = 0.47, RMSE = $6.80
  It also avoids weird variables like the type of music!

Statistics 622               8-19                         Fall 2009
Actual by Predicted Plot 
                                     80

                                     70




                   Ave_Cost Actual
                                     60

                                     50

                                     40

                                     30

                                     20
                                          20    30   40     50   60   70     80
                                           Ave_Cost Predicted P<.0001
                                           RSq=0.47 RMSE=6.8343

                                                 Summary of Fit
                     RSquare                                            0.465
                     RSquare Adj                                        0.454
                     Root Mean Square Error                             6.834
                     Mean of Response                                  39.694
                     Observations (or Sum Wgts)                       197.000
                                               Analysis of Variance
        Source              DF                 Sum of Squares    Mean Square        F Ratio
        Model                4                      7800.251         1950.06      41.7500
        Error              192                      8967.959           46.71      Prob > F
        C. Total           196                     16768.210                       <.0001
                                               Parameter Estimates
Term                                                 Estimate    Std Error   t Ratio   Prob>|t|
Intercept                                               20.22         1.84    10.97     <.0001
Labor_hrs                                               38.68         4.17      9.27    <.0001
(Abstemp-4.6)*(Abstemp-4.6)                              0.07         0.01      6.09    <.0001
(Cost_Kg-1.8)*(Materialcost-2.3)                         0.86         0.15      5.69    <.0001
(Manager{J-R&L}+0.22) *                               -372.50       89.07      -4.18    <.0001
(Brkdown/units-0.00634)




Statistics 622                                       8-20                                Fall 2009
Leverage plots suggest that the model has found some
   additional highly leveraged points that were not identified
   previously.
      What should we do about these?
      What can we learn from these?
                 Ave_Cost Leverage Residuals
                                                          Abstemp*Abstemp

                                               70

                                               60

                                               50

                                               40

                                               30

                                               20
                                                         35   40     45        50   55   60    65
                                                     Abstemp*Abstemp Leverage, P<
                                                     .0001


                                                         Cost_Kg*Materialcost
                 Ave_Cost Leverage Residuals




                                               70

                                               60

                                               50

                                               40

                                               30

                                               20
                                                    35        40          45        50    55
                                                     Cost_Kg*Materialcost Leverage,
                                                     P<.0001




Statistics 622                                                     8-21                             Fall 2009
Visualization of the model reveals some of the structure of
          the model..11 These plots are more interesting if you color-
          code the points for old and new plants.
          Do you see the two groups of points?




11
     JMP will produce a surface plot only for models produced by Fit Model.

Statistics 622                              8-22                              Fall 2009
Back to Business
   Allure of fancy tools
      It is easy to become so enamored by fancy tools that you
      may lose sight of the problem that you’re trying to solve.
      The client wants a model that predicts the cost of a
      production run.
      We’ve now learned enough to be able to return to the client
      with questions of our own. We’re doing much better than
      the naĂŻve initial model (5 predictors, R2 = 0.30 versus the
      improved model with only 4 predictors yet higher R2 =
      0.47).
   What questions should you ask the client in order to
   understand what’s been found by the model?
      What are those leveraged outliers?
      What’s up with temperature controls? Do these have the
      same effect in both plants. (You’ll have to do some data
      analysis to answer this one.)
      What do you make of the categorical factor?
   In other words…
      Stepwise methods leave ample opportunity to exploit what
      you know about the context… You can design more
      sensible features to consider by using what you “know”
      about the problem.
      Ideally, by simplifying the search for additional predictors,
      stepwise methods (or other search technologies) allow you
      to have more time to think about the modeling problem.
      Here are a few substantively motivated comments:



Statistics 622                8-23                          Fall 2009
The features 1/Units and Breakdown/Units make more
           sense (and are more interpretable) as ways of tracking
           fixed costs.
           Similarly, why use Cost/Kg when you can figure out the
           material cost as the product cost/kg × weight?
           Finally, make note of the so-called nesting of managers
           within the different plants. Consider the following table:
                                Plant By Manager
                 Count   JEAN   LEE PAT RANDY TERRY
                 NEW       40      0     0      0 30    70
                 OLD        0    44     42    41   0   127
                           40    44     42    41  30   197
           Jean and Terry work in the new plant, with the others
           working in the old plant. Can you compare Jean to Lee,
           for example? Or does that amount to comparing the two
           plants?
           These two features, Manager and Plant, are confounded
           and cannot be separated by this analysis. (We can,
           however, compare Jean to Terry since they do work in
           the same plant.)




Statistics 622                    8-24                        Fall 2009
Appendix: Bonferroni Method
The Bonferroni Inequality
   The Bonferroni inequality (a.k.a., Boole’s inequality) gives a
   simple upper bound for the probability of a union of events. If
   you simply ignore the double counting, then it follows that
                                              m
                    P(E1 or E 2 oror E m ) ≤ ∑ P(E j )
                                              j=1

   In the special case that all of the events have equal probability
   p = P(Ej), we get the special case
        €
                     P(E1 or E 2 oror E m ) ≤ m p
Use in Model Selection
   In model selection for stepwise regression, we start with a list
            €
   of m possible features of the data that we consider for use in
   the model. Often, this list will include interactions that we
   want to have considered in the model, but are not really very
   sure about.
   If the list of possible predictors is large, then we need to avoid
   “false positives”, adding a variable to the model that is not
   actually helpful. Once the modeling begins to add unneeded
   predictors, it tends to “cascade” by adding more and more.
   We’ll avoid this by trying to never add a predictor that’s not
   helpful.




Statistics 622                  8-25                          Fall 2009
Bonferroni Rule for p-values
   Let the events E1 through Em denote errors in the modeling,
   adding the jth variable when it actually does not affect the
   response. The chance for making any error when we consider
   all m of these is then
            P(some false positive) = P(E1 or E 2 oror E m )
                                 ≤mp
 If we add a feature as a predictor in the model only if its p-
 value is smaller than 0.05/m, say, then the chance for
€incorrectly including a predictor is less then
                                            0.05
                P(some false positive) ≤ m       = 0.05
                                             m
 There’s only a 5% chance of making any mistake.
It’s really pretty good
        €
   Some would say that using this so-called “Bonferroni rule” is
   too conservative: it makes it too hard to find useful predictors.
   It’s actually not so bad.
   (1) For example, suppose that we have m = 1000 possible
   features to sort through. Then the Bonferroni rule says to only
   add a feature if its p-value is smaller than 0.05/1000, 0.00005.
   That seems really small at first, but convert it to a t-ratio.
   How large (in absolute size) does the t-ratio need to be in
   order for the p-value to be smaller than 0.00005? The answer
   is about 4.6.
       In other words, once the t-ratio is larger than around 5, a
       model selection procedure will add the variable. A t-ratio
       of 5 does not seem so unattainable. Sure, it requires a large


Statistics 622                8-26                           Fall 2009
effect, but with so many possibilities, we need to be
       careful.
    (2) Another way to see that Bonferroni is pretty good is to put
    a lower bound on the probability of a false positive. If all of
    the events are independent, then
            P(some false positive) = 1− P(none)
                                  = 1− P(E1c and E 2 and  and E m )
                                                   c             c


                                  = 1− P(E1c ) × P(E 2 ) × × P(E m )
                                                     c            c


                                  = 1− (1− p) m
                                  = 1− e m log(1− p )
                                  ≥ 1− e−m p
  and the last step follows because log(1+x) ≤ x.
  Combined with the Bonferroni inequality, we have (for
€ independent tests)
                 1− e−m p ≤ P(some false positive) ≤ m p
  This table summarizes the implications. It shows that as n
  grows and p gets smaller, the bounds from these inequalities
  are really very tight.
       €

                   m       p       mp               Bounds
                  50    0.01      0.50            0.39 – 0.50
                  50    0.005     0.25            0.22 – 0.25
                  100   0.0001    0.01         0.0095 – 0.0100




 Statistics 622                   8-27                              Fall 2009

More Related Content

What's hot

Step by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectStep by Step guide to executing an analytics project
Step by Step guide to executing an analytics project
Ramkumar Ravichandran
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)
Matt Hansen
 
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
Matt Hansen
 
Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink
Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib KeeminkPython and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink
Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink
PyData
 
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)Hypothesis Testing: Central Tendency – Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)
Matt Hansen
 
Statistical-Process-Control-Analysis-Unraveled_updated210
Statistical-Process-Control-Analysis-Unraveled_updated210Statistical-Process-Control-Analysis-Unraveled_updated210
Statistical-Process-Control-Analysis-Unraveled_updated210
pbaxter
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Matt Hansen
 
ymca821-.pdf-published paper-5.pdf++
ymca821-.pdf-published paper-5.pdf++ymca821-.pdf-published paper-5.pdf++
ymca821-.pdf-published paper-5.pdf++
ramesh chandra sreedhara
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:1)Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:1)
Matt Hansen
 
Hypothesis Testing: Central Tendency – Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Normal (Compare 2+ Factors)Hypothesis Testing: Central Tendency – Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Normal (Compare 2+ Factors)
Matt Hansen
 
Risk Based Loan Approval Framework
Risk Based Loan Approval FrameworkRisk Based Loan Approval Framework
Risk Based Loan Approval Framework
Ramkumar Ravichandran
 
Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)
Matt Hansen
 
Hypothesis Testing: Finding the Right Statistical Test
Hypothesis Testing: Finding the Right Statistical TestHypothesis Testing: Finding the Right Statistical Test
Hypothesis Testing: Finding the Right Statistical Test
Matt Hansen
 
Ysc2013
Ysc2013Ysc2013
Ysc2013
Rob Hyndman
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Matt Hansen
 
50134 09
50134 0950134 09
50134 09
UET Peshawar
 
Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)
Matt Hansen
 
Hypothesis Testing: Proportions (Compare 1:1)
Hypothesis Testing: Proportions (Compare 1:1)Hypothesis Testing: Proportions (Compare 1:1)
Hypothesis Testing: Proportions (Compare 1:1)
Matt Hansen
 
Go Predictive Analytics
Go Predictive AnalyticsGo Predictive Analytics
Go Predictive Analytics
Go Predictive Analytics, LLC
 
Doc 20190909-wa0025
Doc 20190909-wa0025Doc 20190909-wa0025
Doc 20190909-wa0025
PathumWeerasinghe1
 

What's hot (20)

Step by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectStep by Step guide to executing an analytics project
Step by Step guide to executing an analytics project
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)
 
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
 
Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink
Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib KeeminkPython and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink
Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink
 
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)Hypothesis Testing: Central Tendency – Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)
 
Statistical-Process-Control-Analysis-Unraveled_updated210
Statistical-Process-Control-Analysis-Unraveled_updated210Statistical-Process-Control-Analysis-Unraveled_updated210
Statistical-Process-Control-Analysis-Unraveled_updated210
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 2+ Factors)
 
ymca821-.pdf-published paper-5.pdf++
ymca821-.pdf-published paper-5.pdf++ymca821-.pdf-published paper-5.pdf++
ymca821-.pdf-published paper-5.pdf++
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:1)Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:1)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:1)
 
Hypothesis Testing: Central Tendency – Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Normal (Compare 2+ Factors)Hypothesis Testing: Central Tendency – Normal (Compare 2+ Factors)
Hypothesis Testing: Central Tendency – Normal (Compare 2+ Factors)
 
Risk Based Loan Approval Framework
Risk Based Loan Approval FrameworkRisk Based Loan Approval Framework
Risk Based Loan Approval Framework
 
Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)
 
Hypothesis Testing: Finding the Right Statistical Test
Hypothesis Testing: Finding the Right Statistical TestHypothesis Testing: Finding the Right Statistical Test
Hypothesis Testing: Finding the Right Statistical Test
 
Ysc2013
Ysc2013Ysc2013
Ysc2013
 
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
Hypothesis Testing: Central Tendency – Non-Normal (Compare 1:Standard)
 
50134 09
50134 0950134 09
50134 09
 
Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)
 
Hypothesis Testing: Proportions (Compare 1:1)
Hypothesis Testing: Proportions (Compare 1:1)Hypothesis Testing: Proportions (Compare 1:1)
Hypothesis Testing: Proportions (Compare 1:1)
 
Go Predictive Analytics
Go Predictive AnalyticsGo Predictive Analytics
Go Predictive Analytics
 
Doc 20190909-wa0025
Doc 20190909-wa0025Doc 20190909-wa0025
Doc 20190909-wa0025
 

Similar to L08 Over Fitting

Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
Akin Osman Kazakci
 
PM3 ARTICALS
PM3 ARTICALSPM3 ARTICALS
PM3 ARTICALS
ra na
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
Rahul Bhatia
 
Black_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaBlack_Friday_Sales_Trushita
Black_Friday_Sales_Trushita
Trushita Redij
 
Detection of credit card fraud
Detection of credit card fraudDetection of credit card fraud
Detection of credit card fraud
Bastiaan Frerix
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
Eric Esajian
 
BIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNINGBIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNING
IRJET Journal
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
David Murgatroyd
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging Technologies
Databricks
 
Enablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrackEnablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrack
Innovation Enterprise
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
IJCI JOURNAL
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciences
fsmart01
 
Modeling for the Non-Statistician
Modeling for the Non-StatisticianModeling for the Non-Statistician
Modeling for the Non-Statistician
Andrew Curtis
 
Setanta Systems - Supply Chain Report and Analyses Module
Setanta Systems - Supply Chain Report and Analyses ModuleSetanta Systems - Supply Chain Report and Analyses Module
Setanta Systems - Supply Chain Report and Analyses Module
Seabrook Technology Group
 
Risk mgmt-analysis-wp-326822
Risk mgmt-analysis-wp-326822Risk mgmt-analysis-wp-326822
Risk mgmt-analysis-wp-326822
Shubhashish Biswas
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss session
M Baddar
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
IRJET Journal
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction System
IRJET Journal
 
Advantages of Regression Models Over Expert Judgement for Characterizing Cybe...
Advantages of Regression Models Over Expert Judgement for Characterizing Cybe...Advantages of Regression Models Over Expert Judgement for Characterizing Cybe...
Advantages of Regression Models Over Expert Judgement for Characterizing Cybe...
Thomas Lee
 
CREDIT CARD FRAUD DETECTION USING PREDICTIVE MODELLING
CREDIT CARD FRAUD DETECTION USING PREDICTIVE MODELLINGCREDIT CARD FRAUD DETECTION USING PREDICTIVE MODELLING
CREDIT CARD FRAUD DETECTION USING PREDICTIVE MODELLING
IRJET Journal
 

Similar to L08 Over Fitting (20)

Data Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analyticsData Science for Business Managers - An intro to ROI for predictive analytics
Data Science for Business Managers - An intro to ROI for predictive analytics
 
PM3 ARTICALS
PM3 ARTICALSPM3 ARTICALS
PM3 ARTICALS
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
Black_Friday_Sales_Trushita
Black_Friday_Sales_TrushitaBlack_Friday_Sales_Trushita
Black_Friday_Sales_Trushita
 
Detection of credit card fraud
Detection of credit card fraudDetection of credit card fraud
Detection of credit card fraud
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
BIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNINGBIG MART SALES PREDICTION USING MACHINE LEARNING
BIG MART SALES PREDICTION USING MACHINE LEARNING
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging Technologies
 
Enablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrackEnablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrack
 
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSPREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciences
 
Modeling for the Non-Statistician
Modeling for the Non-StatisticianModeling for the Non-Statistician
Modeling for the Non-Statistician
 
Setanta Systems - Supply Chain Report and Analyses Module
Setanta Systems - Supply Chain Report and Analyses ModuleSetanta Systems - Supply Chain Report and Analyses Module
Setanta Systems - Supply Chain Report and Analyses Module
 
Risk mgmt-analysis-wp-326822
Risk mgmt-analysis-wp-326822Risk mgmt-analysis-wp-326822
Risk mgmt-analysis-wp-326822
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss session
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction System
 
Advantages of Regression Models Over Expert Judgement for Characterizing Cybe...
Advantages of Regression Models Over Expert Judgement for Characterizing Cybe...Advantages of Regression Models Over Expert Judgement for Characterizing Cybe...
Advantages of Regression Models Over Expert Judgement for Characterizing Cybe...
 
CREDIT CARD FRAUD DETECTION USING PREDICTIVE MODELLING
CREDIT CARD FRAUD DETECTION USING PREDICTIVE MODELLINGCREDIT CARD FRAUD DETECTION USING PREDICTIVE MODELLING
CREDIT CARD FRAUD DETECTION USING PREDICTIVE MODELLING
 

Recently uploaded

Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 

Recently uploaded (20)

Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 

L08 Over Fitting

  • 1. Fall 2009 Statistics 622 Module 8 Avoiding Over-Confidence OVERVIEW .................................................................................................................................................................................. 2  FROM PREVIOUS CLASSES…..................................................................................................................................................... 3  OVER‐FITTING ........................................................................................................................................................................... 4  AN EXAMPLE OF OVER‐FITTING   (NYSE_2003.JMP) ..................................................................................................... 5  VISUALIZATION .......................................................................................................................................................................... 9  COMMON SENSE TEST ............................................................................................................................................................10  EXAMPLE OF PREDICTING STOCKS ......................................................................................................................................11  WHAT ARE THOSE OTHER PREDICTORS?............................................................................................................................12  PROTECTION FROM OVER‐FITTING .....................................................................................................................................13  BONFERRONI = RIGHT ANSWER + ADDED BONUS ...........................................................................................................15  OTHER APPLICATIONS OF THE BONFERRONI RULE .........................................................................................................16  DETECTING OVER‐FITTING WITH A VALIDATION SAMPLE .............................................................................................17  CONTROLLING STEPWISE WITH A VALIDATION SAMPLE  (BLOCK.JMP) .......................................................................19  BACK TO BUSINESS .................................................................................................................................................................23  APPENDIX  BONFERRONI METHOD .......................................................................................................................................25  THE BONFERRONI INEQUALITY ............................................................................................................................................25  USE IN MODEL SELECTION ....................................................................................................................................................25  BONFERRONI RULE FOR P‐VALUES ......................................................................................................................................26  IT’S REALLY PRETTY GOOD ....................................................................................................................................................26  Copyright Robert A Stine Revised 10/8/09
  • 2. Overview Stepwise models Select most predictive features from a list that you provide of candidate features, incrementally improving the fit of the model by as much as possible at each step. When automated, the search continues so long as the feature improves the model enough as gauged by its p- value. Over-fitting1 If the search is allowed to choose predictors too “easily”, stepwise selection will identify predictors that ought not be in the model, producing an artificially good fit when in fact the model has been getting worse and worse. Bonferroni rule The Bonferroni rule lets us halt the search without having to set aside a validation sample, allowing us use all the data for finding a predictive model rather than a subset. Though automatic, you should still use your knowledge of the context to offer more informed choices of features to consider for the modeling. 1 For another example of over-fitting when modeling stock returns, see BAUR pages 220-227. Statistics 622 8-2 Fall 2009
  • 3. From previous classes… Cost of uncertainty An accurate estimate of mean demand improves profits. Suggests that we should use more predictors in models, including more combinations of features that capture synergies among the features (interactions). Stepwise regression Automates the tedious process of working through the various interactions and other candidate features. Problem: Over-confidence The combination of Desire for more accurate predictions + Automated searches that maximize fitted R2 Creates the possibility that our predictions are not so accurate as we think. Over-fitting results when the modeling process leads us to build a model that captures random patterns in the data that will not be present in predicting new cases. The fit of the model looks better on paper than in reality Other situations with over-confidence Subjective confidence intervals Winners curse in auctions Two methods for recognizing and avoiding over-fitting Bonferroni p-values, which do not require the use of a validation sample in order to test the model Cross-validation, which requires setting aside data to test the fit of a model. Statistics 622 8-3 Fall 2009
  • 4. Over-fitting False optimism Is your model as good as it claims? Or, has your hard work to improve its fit to the data exaggerated its accuracy? “Optimization capitalizes on When we use the same data to both fit and evaluate a chance” model, we get an “optimistic” impression of how well the model predicts. This process that leads to an exaggerated sense of accuracy is known as over-fitting. When a model has been over-fit, predictors that appear significant from the output do not in fact improve the model’s ability to predict the new cases. Perhaps many of the predictors that are in a model have arrived by chance alone because we have considered so many possible models. Over-fitting Adding features to a model that improve its fit to the observed data, but that degrade the ability of a model to predict new cases. Iterative refinement of a model (either manually or by an automated algorithm) in order to improve the usual summaries (e.g., R2 and p-values) typically generates a better fit to the observed data that pick the predictors than will be had when predicting new data. No good deed goes unpunished! It’s the process, not the model Over-fitting does not happen if we pick a large group of predictors and simply fit one big model, without iteratively trying to improve its fit. Statistics 622 8-4 Fall 2009
  • 5. An Example of Over-fitting (nyse_2003.jmp) Stock market analysis Over-fitting is common in domains in which there is a lot of pressure to obtain accurate predictions, as in the case of predicting the direction of the stock market. Data: daily returns on the NYSE composite index in October and November 2003. Objective: Build a model to predict what will happen in December 2003, using a battery of 12 trading rules (labeled X1 to X12). These are a few very basic technical trading rules. Model selection criteria Many numerical criteria have been proposed that can be used as an alternative to maximizing R2 to judge the quality of a good model. This table lists several well-known criteria. To use these in forward stepwise, control the forward search by using these “Prob-to-enter” values. Name Prob-to-Enter Approximate t- Idea stat for inclusion Adjusted .33 |t| > 1 Decrease RMSE R2 AIC, Cp .16 |t| > √2 Unbiased estimate of prediction accuracy BIC Depends on n |t| > ½ log n Bayesian probability Bonferroni 1/m |t| > √(2 log m) Minimize worst case, family wide error rate Statistics 622 8-5 Fall 2009
  • 6. Search domain for the example Consider interactions among 12 exogenous features. The total number of features available to stepwise is then m = 12 + 12 + 12 × 11/2 = 24 + 66 = 90 Wide data set There are 42 trading days in October and November. With interactions, we have more features than cases to use. m = 90 > n = 42 Hence we cannot fit the saturated model with all features.2 AIC criterion for forward search Set “Prob to Enter” = 0.16 and run the search forward. The stepwise search never stops! A greedy search becomes gluttonous when offered so many choices relative to the number of cases that are available. 2 You can show that often the best model is the so-called “saturated” model that has every feature included as a predictor in the fit. But, you can only do this when you have more cases than features, typically at least 3 per predictor (a crude rule of thumb for the ratio n/m). Statistics 622 8-6 Fall 2009
  • 7. To avoid the cascade, make it harder to add a predictor; reducing the “Prob to enter” to 0.10 gives this result: The search stops after adding 20 predictors. Optionally, following a common convention, we can “clean up” the fit and make it appear more impressive by stepping backward to remove collinear predictors that are redundant. The backward elimination removes 3 predictors. Statistics 622 8-7 Fall 2009
  • 8. Make the model and obtain the usual summary. This “Summary of Fit” suggests a great model. Any diagnostic procedure that ignores how we chose the features to include in this model finds no problem. All conclude that this is a great-fitting model, one that is highly statistically significant. Look at all of the predictors whose p-value < 0.0001. These easily meet the Bonferroni threshold, when applied after the fact. Summary of Fit RSquare 0.949 Root Mean Square Error 0.191 Mean of Response 0.177 Observations (or Sum Wgts) 42 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 17 16.214437 0.953790 26.1361 Error 24 0.875838 0.036493 Prob > F C. Total 41 17.090274 <.0001 Parameter Estimates Term Est Std Err t Ratio Prob>|t| Intercept -0.090 0.058 -1.56 0.1317 Exogenous 6 0.093 0.036 2.60 0.0156 Exogenous 9 0.256 0.046 5.59 <.0001 Exogenous 10 0.326 0.058 5.62 <.0001 (Exogenous 2-0.19088)*(Exogenous 3+0.07326) 0.192 0.035 5.52 <.0001 (Exogenous 2-0.19088)*(Exogenous 5-0.11786) 0.181 0.043 4.19 0.0003 (Exogenous 3+0.07326)*(Exogenous 5-0.11786) -0.209 0.038 -5.45 <.0001 (Exogenous 5-0.11786)*(Exogenous 6-0.07955) 0.178 0.030 5.88 <.0001 (Exogenous 8+0.13772)*(Exogenous 8+0.13772) 0.087 0.031 2.78 0.0105 (Exogenous 1+0.21142)*(Exogenous 9-0.32728) -0.412 0.048 -8.66 <.0001 (Exogenous 2-0.19088)*(Exogenous 9-0.32728) 0.198 0.044 4.51 0.0001 (Exogenous 5-0.11786)*(Exogenous 9-0.32728) 0.384 0.062 6.18 <.0001 (Exogenous 6-0.07955)*(Exogenous 10+0.03726) 0.183 0.036 5.05 <.0001 (Exogenous 7-0.23689)*(Exogenous 10+0.03726) 0.252 0.057 4.45 0.0002 (Exogenous 10+0.03726)*(Exogenous 10+0.03726) 0.202 0.027 7.38 <.0001 (Exogenous 2-0.19088)*(Exogenous 11+0.04288) -0.115 0.047 -2.46 0.0215 (Exogenous 6-0.07955)*(Exogenous 11+0.04288) 0.132 0.057 2.30 0.0304 (Exogenous 10+0.03726)*(Exogenous 12+0.18472) 0.263 0.046 5.69 <.0001 Statistics 622 8-8 Fall 2009
  • 9. Visualization The surface contour shows that there’s a lot of curvature in the fit of the model, but unlike the curvature seen in several prior examples, the data do not seem to show visual evidence of the curvature. No pair of predictors appears particularly predictive, although the overall model is. This plot shows the curvature of the prediction formula using predictors 8 and 10 along the bottom.3 3 Save the prediction formula from your regression model. Then select Graphics > Surface Plot and fill the dialog for the variables with the prediction formula as well as the column that holds the response data. To produce such a plot, you need a recent version of JMP. Statistics 622 8-9 Fall 2009
  • 10. Common Sense Test: Hold-back some data Question Is this fit an example of the ability of multiple regression to find “hidden effects” that simpler models miss? There’s no real substance to rely upon to find an explanation for the model. We have too many explanatory variables than we can sensibly interpret. Simple idea (cross-validation) Reserve some data in order to test the model, such as the next month of returns. Fit model to a training/estimation sample, then predict cases in test/validation sample. Catch-22 How much to reserve, or set aside, for checking the model? No clear-cut answer. Save a little. This choice leaves too much variation in your measure of how well the model has done. A model might look good simply by chance. If we were to only reserve, say, 5 cases to test the model, then it might “get lucky” and predict these 5 well, simply by chance. Save a lot. This choice leaves too few cases available to find good predictors. We end up with a good estimate of the performance of a poor model. When trying to improve a model or find complex effects, we’ll do better with more data to identify the effects. Statistics 622 8-10 Fall 2009
  • 11. Example of Predicting Stocks What happens in December? The model that looks so good on paper flops miserably when put to this simple test. The fitted equation predicts the estimation cases remarkably well, but produces large prediction errors when extended out-of-sample to the next month. Plot of the prediction errors. Left: in-sample errors, residuals from the fitted model. Right: out-of-sample errors in the forecast period.4 The residuals are small during the estimation period (October – November), in contrast to the size of the errors when the model is used to predict the returns on the NYSE during December. 5 4 3 Prediction Error 2 1 0 -1 -2 -3 October November December -4 20031001 20031101 20031201 20040101 This model has been over-fit,Cal_Date producing poor forecasts for December. The usual summary statistics conceal the selection process that was used to identify the model. 4 The horizontal gaps between the dots are the weekends or holidays. Statistics 622 8-11 Fall 2009
  • 12. What are those other predictors? Random noise! The 12 basic features X1, X2, … X12 that were called “technical trading rules” are in fact columns of simulated samples from normal distributions.5 Any model that uses these as predictors over-fits the data. But the final model looks so good! True, but the out-of-sample predictions show how poor it is. A better prediction would be to use the average of the historical data instead. In this example, we know (because the “exogenous rules” are simulated random noise) that the true coefficients for these variables are all zero. Why doesn’t the final overall F-ratio find the problem? The standard test statistics work “once”, as if you postulated one model before you saw the data. Stepwise tries hundreds of variables before choosing these. Finding a p-value less than 0.05 is not unusual if you look at, say, 100 possible features. Among these, you’d expect to find 5 whose p-value < 0.05 by chance alone. Cannot let stepwise procedure add such variables In this example, the first step picks the worst variable: one that adds actually adds nothing but claims to do a lot. The effect of adding this spurious predictor is to bias the estimate of error variation. That is, the RMSE is now smaller than it should be. The bias inflates the t-statistics for every other feature. 5 Thereby giving away my opinion of many technical trading rules. Statistics 622 8-12 Fall 2009
  • 13. Source of the cascade Suppose stepwise selection incorrectly picks a predictor that it should not have, one for which β = 0. The reason that it picks the wrong predictor is that, by chance, this predictor explains a lot of variation (has a large correlation with the response, here stock returns). The predictor is useless out-of-sample but looks good within the estimation sample. As a result, the model looks better while at the same time actually performing worse. The result is a biased estimate of the amount of unexplained variation. RMSE gets smaller when in fact the model fits worse; it should be larger – not smaller – after adding this feature. The biased RMSE, being too small, makes all of the other features look better; t-statistics of features that are not in the model suddenly get larger than they should be. These inflated t-stats make it easier to add other useless features to the model, forming a cascade as more spurious predictors join the model. The EverReady bunny. Protection from Over-fitting Many have been “burned” by using a method like stepwise regression and over-fitting. A frequently-heard complaint: “The model looked fine when we built it, but when we rolled it out in the field it failed completely. Statistics is useless. Lies, damn lies, statistics.” Protections from over-fitting include the following: (a) Avoid automatic methods Sure, and why not use an abacus, slide rule, and normal table while you’re at it? It’s not the computer per se, but Statistics 622 8-13 Fall 2009
  • 14. rather the shoddy way that we have used the automatic search. The same concerns apply to tedious manual searches as well. (b) Arrogant: Stick to substantively-motivated predictors Are you so confident that you know all there is to know about which factors affect the response? Particularly troubling when it comes to interactions. Even so, you can use stepwise selection after picking a model as a diagnostic. That is, use stepwise to learn whether a substantively motivated model has missed structure. Start with a non-trivial substantively motivated model. It should include the predictors that your knowledge of the domain tells you belong. Then run stepwise to see whether it finds other things that might be relevant. (c) Cautious: Use a more stringent threshold Add a feature only when the results are convincing that the feature has a real effect, not a coincidence. We can do this by using the Bonferroni rule. If you have a list of m candidate features, then set “Prob to enter” = 0.05/m. Statistics 622 8-14 Fall 2009
  • 15. Bonferroni = Right Answer + Added Bonus What happens in the stock example? Set the Prob-to-enter threshold to 0.05 divided by m, number of features that are being considered. In this example, the number of considered features is 12 “raw” + 12 “squares” + 12×11/2 “interactions”= 90 “Prob to enter” = 0.05/90 = .00056 Remove all of the predictors from the stepwise dialog, change the “Prob to enter” field to 0.00056, and click go.6 The search finds the right answer: it adds nothing! No predictor enters the model, and we’re left with a regression with just an intercept. None should be in the model; the “null model” is the truth.7 The “technical trading rules” used as predictors are random noise, totally unrelated to the response. Added bonus The use of the Bonferroni rule for guiding the selection process avoids the need to reserve a validation sample in order to test your model and avoid over-fitting. Just set the appropriate “Prob to enter” and use all of the data to fit the model. A larger sample allows the modeling to identify more subtle features that would otherwise be missed. 6 JMP rounds the value input for p-to-enter that is shown in the box in the stepwise dialog, even though the underlying code will use the value that you have entered. 7 Some of the predictors in the stepwise model claim to have p-values that pass the Bonferroni rule. Once stepwise introduces noise into the regression, it can add more and more and these look fine. You need to use Bonferroni before adding the variables, not after. Statistics 622 8-15 Fall 2009
  • 16. Other Applications of the Bonferroni Rule You can (and generally should) use the Bonferroni rule in other situations in regression as well. Any time that you look at a collection of p-values to judge statistical significance, consider using a Bonferroni adjustment to the p-values. Testing in multiple regression Suppose you fit a multiple regression with 5 predictors. No selection or stepwise, just fit the model with these predictors. How should you judge the statistical results? Two-stage process (1) Check the overall F-ratio, shown in the Anova summary of the model. This tests whether the R2 of the model is large given the number of predictors in the fitted model and the number of observations. (2) If the overall F-ratio is statistically significant, then consider the individual t-statistics for the coefficients using a Bonferroni rule for these. Suppose the model as a whole is significant, and you have moved to the individual slopes. If you are looking at p- values of a model with 5 predictors, then compare them to 0.05/5 = 0.01 before you get excited about finding a statistically significant effect. Tukey comparisons The use of Tukey-Kramer comparisons among several means is an alternative way to avoid claiming artificial statistical significance in the specific case of comparing many averages. Statistics 622 8-16 Fall 2009
  • 17. Detecting Over-fitting with a Validation Sample Bonferroni is not always possible. Some methods do not allow this type of control on over- fitting because they do not offer p-values. Reserve a validation sample It is common in time series modeling to set aside future data to check the predictions from your model. We did it with the stocks without giving it much thought.8 Divide the data set into two batches, one for fitting the model and the second for evaluating the model. The validation sample should be “locked away” excluded from the modeling process, and certainly not “shown” to the search procedure. Software issues JMP’s “Column Shuffle” command makes this separation into two batches easy to do. For example: This formula defines a column that labels a random sample of 50 cases (rows) as validation cases, with the rest labeled as estimation cases.9 Then use the “Exclude” & “Hide” commands from the rows menu to set aside and conceal the validation cases. 8 At some point with time series models, you won’t be able to set aside data. If you’re trying to predict tomorrow, do you really want to use a model built to data that is a month older? 9 Only 47 cases appear in the validation sample in the next example because it so happened that 3 excluded outliers fall among the validation cases. Statistics 622 8-17 Fall 2009
  • 18. Questions when using a validation sample 1. How many observations should I put into the validation sample. 2. How can I use the validation sample to identify over- fitting? In the blocks example introduced in Module 7, we have n = 200 runs to build a model.10 That produces the following paradox: If we set aside, say, half for validation, then we’ll have a hard time finding good predictors. On the other hand, if we only set aside, say, 10 cases for validation, maybe these may be insufficient to give a valid impression of how well the model has done. A fit might do well on these 10 by chance. Multi-fold cross-validation A better alternative, if we had the software needed to automate the process, repeats the validation process over and over. 5-fold cross-validation: Divide data into subsets, each with 20% of the cases. Fit your model on 4 subsets, then predict the other. Do this 5 times, each time omitting a different subset. Accumulate the prediction errors. Repeat! 10 So, why not go back to the client and say “I need more data.” Getting data is expensive unless its already been captured in the system. Often, as in this example, the features for each run have to be found by manually searching back through records. Statistics 622 8-18 Fall 2009
  • 19. Controlling Stepwise with a Validation Sample (block.jmp) Prior version of the cost-accounting model had 15 predictors with an R2 of 69% and RMSE of $5.80. Using the Bonferroni rule to control the stepwise search gives the model shown on the next page… It is hard to count how many predictors JMP can choose from because categorical terms get turned into several dummy variables. We can estimate m by counting the number of “screens” needed to show the candidate features. With m ≈ 385 features to consider, the Bonferroni threshold for the “Prob to enter” criterion is 0.05/385 = 0.00013 The resulting model appears on the next page. The claimed model is more parsimonious and does not claim the precision produced by the prior search. The model has 4 predictors, with R2 = 0.47, RMSE = $6.80 It also avoids weird variables like the type of music! Statistics 622 8-19 Fall 2009
  • 20. Actual by Predicted Plot  80 70 Ave_Cost Actual 60 50 40 30 20 20 30 40 50 60 70 80 Ave_Cost Predicted P<.0001 RSq=0.47 RMSE=6.8343 Summary of Fit RSquare 0.465 RSquare Adj 0.454 Root Mean Square Error 6.834 Mean of Response 39.694 Observations (or Sum Wgts) 197.000 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 4 7800.251 1950.06 41.7500 Error 192 8967.959 46.71 Prob > F C. Total 196 16768.210 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 20.22 1.84 10.97 <.0001 Labor_hrs 38.68 4.17 9.27 <.0001 (Abstemp-4.6)*(Abstemp-4.6) 0.07 0.01 6.09 <.0001 (Cost_Kg-1.8)*(Materialcost-2.3) 0.86 0.15 5.69 <.0001 (Manager{J-R&L}+0.22) * -372.50 89.07 -4.18 <.0001 (Brkdown/units-0.00634) Statistics 622 8-20 Fall 2009
  • 21. Leverage plots suggest that the model has found some additional highly leveraged points that were not identified previously. What should we do about these? What can we learn from these? Ave_Cost Leverage Residuals Abstemp*Abstemp 70 60 50 40 30 20 35 40 45 50 55 60 65 Abstemp*Abstemp Leverage, P< .0001 Cost_Kg*Materialcost Ave_Cost Leverage Residuals 70 60 50 40 30 20 35 40 45 50 55 Cost_Kg*Materialcost Leverage, P<.0001 Statistics 622 8-21 Fall 2009
  • 22. Visualization of the model reveals some of the structure of the model..11 These plots are more interesting if you color- code the points for old and new plants. Do you see the two groups of points? 11 JMP will produce a surface plot only for models produced by Fit Model. Statistics 622 8-22 Fall 2009
  • 23. Back to Business Allure of fancy tools It is easy to become so enamored by fancy tools that you may lose sight of the problem that you’re trying to solve. The client wants a model that predicts the cost of a production run. We’ve now learned enough to be able to return to the client with questions of our own. We’re doing much better than the naĂŻve initial model (5 predictors, R2 = 0.30 versus the improved model with only 4 predictors yet higher R2 = 0.47). What questions should you ask the client in order to understand what’s been found by the model? What are those leveraged outliers? What’s up with temperature controls? Do these have the same effect in both plants. (You’ll have to do some data analysis to answer this one.) What do you make of the categorical factor? In other words… Stepwise methods leave ample opportunity to exploit what you know about the context… You can design more sensible features to consider by using what you “know” about the problem. Ideally, by simplifying the search for additional predictors, stepwise methods (or other search technologies) allow you to have more time to think about the modeling problem. Here are a few substantively motivated comments: Statistics 622 8-23 Fall 2009
  • 24. The features 1/Units and Breakdown/Units make more sense (and are more interpretable) as ways of tracking fixed costs. Similarly, why use Cost/Kg when you can figure out the material cost as the product cost/kg × weight? Finally, make note of the so-called nesting of managers within the different plants. Consider the following table: Plant By Manager Count JEAN LEE PAT RANDY TERRY NEW 40 0 0 0 30 70 OLD 0 44 42 41 0 127 40 44 42 41 30 197 Jean and Terry work in the new plant, with the others working in the old plant. Can you compare Jean to Lee, for example? Or does that amount to comparing the two plants? These two features, Manager and Plant, are confounded and cannot be separated by this analysis. (We can, however, compare Jean to Terry since they do work in the same plant.) Statistics 622 8-24 Fall 2009
  • 25. Appendix: Bonferroni Method The Bonferroni Inequality The Bonferroni inequality (a.k.a., Boole’s inequality) gives a simple upper bound for the probability of a union of events. If you simply ignore the double counting, then it follows that m P(E1 or E 2 oror E m ) ≤ ∑ P(E j ) j=1 In the special case that all of the events have equal probability p = P(Ej), we get the special case € P(E1 or E 2 oror E m ) ≤ m p Use in Model Selection In model selection for stepwise regression, we start with a list € of m possible features of the data that we consider for use in the model. Often, this list will include interactions that we want to have considered in the model, but are not really very sure about. If the list of possible predictors is large, then we need to avoid “false positives”, adding a variable to the model that is not actually helpful. Once the modeling begins to add unneeded predictors, it tends to “cascade” by adding more and more. We’ll avoid this by trying to never add a predictor that’s not helpful. Statistics 622 8-25 Fall 2009
  • 26. Bonferroni Rule for p-values Let the events E1 through Em denote errors in the modeling, adding the jth variable when it actually does not affect the response. The chance for making any error when we consider all m of these is then P(some false positive) = P(E1 or E 2 oror E m ) ≤mp If we add a feature as a predictor in the model only if its p- value is smaller than 0.05/m, say, then the chance for €incorrectly including a predictor is less then 0.05 P(some false positive) ≤ m = 0.05 m There’s only a 5% chance of making any mistake. It’s really pretty good € Some would say that using this so-called “Bonferroni rule” is too conservative: it makes it too hard to find useful predictors. It’s actually not so bad. (1) For example, suppose that we have m = 1000 possible features to sort through. Then the Bonferroni rule says to only add a feature if its p-value is smaller than 0.05/1000, 0.00005. That seems really small at first, but convert it to a t-ratio. How large (in absolute size) does the t-ratio need to be in order for the p-value to be smaller than 0.00005? The answer is about 4.6. In other words, once the t-ratio is larger than around 5, a model selection procedure will add the variable. A t-ratio of 5 does not seem so unattainable. Sure, it requires a large Statistics 622 8-26 Fall 2009
  • 27. effect, but with so many possibilities, we need to be careful. (2) Another way to see that Bonferroni is pretty good is to put a lower bound on the probability of a false positive. If all of the events are independent, then P(some false positive) = 1− P(none) = 1− P(E1c and E 2 and  and E m ) c c = 1− P(E1c ) × P(E 2 ) × × P(E m ) c c = 1− (1− p) m = 1− e m log(1− p ) ≥ 1− e−m p and the last step follows because log(1+x) ≤ x. Combined with the Bonferroni inequality, we have (for € independent tests) 1− e−m p ≤ P(some false positive) ≤ m p This table summarizes the implications. It shows that as n grows and p gets smaller, the bounds from these inequalities are really very tight. € m p mp Bounds 50 0.01 0.50 0.39 – 0.50 50 0.005 0.25 0.22 – 0.25 100 0.0001 0.01 0.0095 – 0.0100 Statistics 622 8-27 Fall 2009