Resampling Methods Explained

Outline
• Background
• Jackknife
• Bootstrap
• Permutation
• Cross-validation

Why do we need resampling?
• Purpose of statistics is to estimate some parameter(s) and reliability
of them. Since estimators are function of the sample points they are
random variables. If we could find distribution of this random
variable (sample statistic) then we could estimate reliability of the
estimators.
• If we would have sampling distribution for the sampling statistics
then we can estimate variance of the estimator, interval, even test
hypotheses

• Unfortunately apart from the simplest cases, sampling distribution is
not easy to derive.
• What is the sampling distribution of:
• The time since most recent common ancestor of all humans?
• The adjusted R-squared?
• The AIC?
• The beta coefficient when independence is violated?
• The number of connections in a neural net?
• The eigenvalues of PCA?
• A bifurcation point in a phylogenetic tree?

• Unfortunately apart from the simplest cases, sampling distribution is
not easy to derive. There are several techniques to approximate
these distributions e.g., Laplace approximation. These
approximations give analytical form for the approximate
distributions. With advent of computers more computationally
intensive methods are emerging. They work in many cases
satisfactorily.

• The t-distribution and chi-squared distribution are good
approximations for sufficiently large and/or normally-distributed
samples.
• However, when data is of un-known distribution or sample size is
small, re-sampling tests are recommended.

Resampling Methods
• Jackknife
• Bootstrap
• Permutation
• Cross-validation

Resampling
Method
Application Sampling
procedure used
Bootstrap Standard deviation,
confidence interval,
hypothesis testing,
bias
Samples drawn at
random, with
replacement
Jackknife Standard deviation,
confidence interval,
bias
Samples consist of
full data set with one
observation left out
Permutation Hypothesis testing Samples drawn at
random, without
replacement.
Cross-validation Model validation Data is randomly
divided into two or
more subsets, with
results validated
across sub-samples.

Jacknife
Jacknife is used for bias removal. As we know that mean-square error for an
estimator is equal to square of bias plus variance of the estimator. If bias is
much higher than variance then under some circumstances Jacknife could be
used.
Description of Jacknife: Let us assume that we have a sample of size n. We estimate
some sample statistics using all data – tn. Then by removing one point at a
time we estimate tn-1,i, where subscript indicates size of the sample and index
of removed sample point. Then new estimator is derived as:
If the order of the bias of the statistic tn is O(n-1) then after jacknife order of the bias
becomes O(n-2).
Variance is estimated using:
This procedure can be applied iteratively. I.e. for new estimator jacknife can be
applied again. First application of Jacknife can reduce bias without changing
variance of the estimator. But its second and higher order application can in
general increase the variance of the estimator.
n
t
ttnntt
n
i
in
nnnn


  1
,1
11
'
where,)1(



 


1
1
2
1,1 )(
1ˆ
n
i
ninJ tt
n
n
V

The bootstrap
• 1969 Simon publishes the bootstrap as an example in Basic Research
Methods in Social Science (the earlier pigfood example)
• 1979 Efron names and publishes first paper on the bootstrap

Bootstrap (Nonparametric)
1 2( , ,.... )nx x x x
( )t F 
Have a random sample
from an unknown PDF, F.
Want to estimate based on .
We calculate the estimate based on .
Want to know how accurate is .
x
xˆ ( )s x 
ˆ

1 2( , ,.... )nx x x x
ˆF
Notation:
Random sample:
Empirical distribution , places mass of 1/n at
each observed data value.
Bootstrap sample: Random sample of size n,
drawn from , denoted as
Bootstrap replicate of :
ˆF
1 2* ( *, *,..., *)nx x x x
ˆ ˆ* ( *)s x 

¶
* * 2
1
( )
1
B
ii
BSE
B
 




Bootstrap steps:
1. Select bootstrap sample
consisting of n data values drawn with replacement
from the original data set.
2. Evaluate for the bootstrap sample
3. Repeat steps 2 and 3 B times each.
4. Estimate the standard error by the sample
standard deviation of the B replications:
1 2* ( *, *,..., *)nx x x x
ˆ* ( *)s x 
ˆ( )Fse 

The Bootstrap
• A new pigfood ration is tested on twelve pigs,
with six-week weight gains as follows:
• 496 544 464 416 512 560 608 544 480 466 512 496
• Mean: 508 ounces (establish a confidence
interval)

Draw simulated samples from
a hypothetical universe that
embodies all we know about
the universe that this
sample came from – our
sample, replicated an
infinite number of times
The Classic Bootstrap

1. Put the observed weight gains in a hat
2. Sample 12 with replacement
3. Record the mean
4. Repeat steps 2-3, say, 1000 times
5. Record the 5th and 95th percentiles (for a
90% confidence interval)

Parametric Bootstrap
Resampling makes no assumptions about the
population distribution. The bootstrap covered thus far
is a nonparametric bootstrap. If we have information
about the population distr., this can be used in
resampling. In this case, when we draw randomly from
the sample we can use population distr. For example, if
we know that the population distr. is normal then
estimate its parameters using the sample mean and
variance. Then approximate the population distr. with
the sample distr. and use it to draw new samples.

Parametric Bootstrap
As expected, if the assumption about population
distribution is correct then the parametric
bootstrap will perform better than the
nonparametric bootstrap. If not correct, then the
nonparametric bootstrap will perform better.

Example of Bootstrap (Nonparametric)
Have test scores (out of 100) for two consecutive years
for each of 60 subjects. Want to obtain the
correlation between the test scores and the variance
of the correlation estimate.
Can use bootstrap to obtain the variance estimate.

How many Bootstrap Replications, B?
 A fairly small number, B=25, is sufficient to
be “informative” (Efron)
 B=50 is typically sufficient to provide a crude
estimate of the SE, but B>200 is generally
used.
 CIs require larger values of B, B no less than
500, with B=1000 recommended.

Permutation Tests
 In classical hypothesis testing, we start with
assumptions about the underlying distribution and
then derive the sampling distribution of the test
statistic under H0.
 In Permutation testing, the initial assumptions are
not needed (except exchangeability), and the
sampling distribution of the test statistic under H0
is computed by using permutations of the data.

Permutation Tests (example)
• The Permutation test is a technique that bases
inference on “experiments” within the observed
dataset.
• Consider the following example:
• In a medical experiment, rats are randomly
assigned to a treatment (Tx) or control (C) group.
• The outcome Xi is measured in the ith rat.

• Under H0, the outcome does not depend on whether
a rat carries the label Tx or C.
• Under H1, the outcome tends to different, say larger
for rats labeled Tx.
• A test statistic T measures the difference in
observed outcomes for the two groups. T may be
the difference in the two group means (or medians),
denoted as t for the observed data.

• Under H0, the individual labels of Tx and C are
unimportant, since they have no impact on the
outcome. Since they are unimportant, the label can
be randomly shuffled among the rats without
changing the joint null distribution of the data.
• Shuffling the data creates a “new” dataset. It has
the same rats, but with the group labels changed
so as to appear as there were different group
assignments.

• Let t be the value of the test statistic from the
original dataset.
• Let t1 be the value of the test statistic computed
from a one dataset with permuted labels.
• Consider all M possible permutations of the labels,
obtaining the test statistics,
t1, …, tM.
• Under H0, t1, …, tM are all generated from the same
underlying distribution that generated t.

 Thus, t can be compared to the permuted data
test statistics, t1, …, tM , to test the hypothesis
and obtain a p-value or to construct confidence
limits for the statistic.

• Survival times
• Treated mice 94, 38, 23, 197, 99, 16, 141
• Mean: 86.8
• Untreated mice 52, 10, 40, 104, 51, 27, 146,
30, 46
• Mean: 56.2
(Efron & Tibshirani)

Calculate the difference between the means of the
two observed samples – it’s 30.6 days in favor of the
treated mice.
Consider the two samples combined (16
observations) as the relevant universe to resample
from.

 Draw 7 hypothetical observations and designate
them "Treatment"; draw 9 hypothetical
observations and designate them "Control".
 Compute and record the difference between the
means of the two samples.

 Repeat steps 3 and 4 perhaps 1000 times.
 Determine how often the resampled difference exceeds
the observed difference of 30.6

Histogram of permuted differences

 If the group means are truly equal, then
shifting the group labels will not have a big
impact the sum of the two groups (or
mean with equal sample sizes). Some
group sums will be larger than in the
original data set and some will be smaller.

Permutation Test Example 1
•16!/(16-7)!= 57657600
•Dataset is too large to enumerate all
permutations, a large number of random
permutations are selected.
•When permutations are enumerated, this is
an exact permutation test.

38
1.1 What is Permutation Tests?
• Permutation tests are significance tests based on permutation
resamples drawn at random from the original data. Permutation
resamples are drawn without replacement.
• Also called randomization tests, re-randomization tests, exact tests.
• Introduced by R.A. Fisher
and E.J.G. Pitman in the 1930s.
R.A. Fisher E.J.G. Pitman

39
When Can We Use Permutation Tests?
• Only when we can see how to resample in a way that is consistent
with the study design and with the null hypothesis.
• If we cannot do a permutation test, we can often calculate a
bootstrap conﬁdence interval instead.

Advantages
40
Exist for any test
statistic, regardless of
whether or not its
distribution is known
Free to choose the statistic which
best discriminates between
hypothesis and alternative and
which minimizes losses
Can be used for:
- Analyzing unbalanced designs;
- Combining dependent tests on mixtures of categorical,
ordinal, and metric data.

Limitations
41
The observations are exchangeable
under the null hypothesis
An Important
Assumption
Tests of difference in location (like a
permutation t-test) require equal
variance
Consequence
The permutation t-test shares the same
weakness as the classical Student’s t-
test.
In this respect

42
Procedure of Permutation Tests
Analyze the problem.
- What is the hypothesis and alternative?
- What distribution is the data drawn from?
- What losses are associated with bad decisions?
Choose a test statistic which will distinguish
the hypothesis from the alternative.
Compute the test statistic for the original data
of the observations.
I
II
III

43
Procedure of Permutation Tests
IV
V
Rearrange the Observations
Compute the test statistic for all possible
permutations (rearrangements) of the data of the
observations
Make a decision
Reject the Hypothesis: if the value of the test
statistic for the original data is an extreme value in
the permutation distribution of the statistic.
Otherwise, accept the hypothesis and reject the
alternative.

44
Permutation Resampling Process
5 7 8 4 1 6 8 9 7 5 9
5 7 8 4 1 6 8 9 7 5 9
5 7 1
5 9
8 4 6
8 9 7
Median (5 7 1 5 9) Median (8 4 6 8 9 7)
Compute “difference statistic”, save result in table
and repeat resampling process 1000+ iterations
Collect Data from Control &
Treatment Groups
Merge Samples to form a
pseudo population
Sample without replacement
from pseudo population to
simulate Control and
Treatment Groups
Compute target statistic for
each resample

45
An physiology experiment to found the relationship
between Vitamin E and human “life-extending”
Example: “I Lost the Labels”

46
Example: “I Lost the Labels”
• 6 petri dishes:
• 3 containing standard medium
• 3 containing standard medium + Vitamin E
• Without the labels, we have no way of knowing which cell cultures
have been treated with Vitamin E and which have not.
• There are six numbers “121, 118, 110, 34, 12, 22”, each one belongs
to the petri dishes’ results.
• The number belongs to which dishes?

48
Using the original T-test to Find P-value

49
Permutation Resampling Process
121 118 110 34 12 22
121 118 110 34 12 22
121 118
34
110 12
22
Median=91 Median=48
Compute “difference statistic”, save result in table
and repeat resampling process 1000+ iterations
Collect Data from Control &
Treatment Groups
Merge Samples to form a
pseudo population
Sample without replacement
from pseudo population to
simulate Control and
Treatment Groups
Compute target statistic for
each resample

51
Formula in Permutation need

53
How is the conclusion
• Test decision The absolute value of the test statistic t ≥ =
13.0875 we obtained for the original labeling.
• We obtain the exact p value p = 2/20 = 0.1.
• Note: If both groups have equal size, Only half of permutations is
really needed (symmetry)

What are resampling methods?
• Tools that involves repeatedly drawing samples from a training set
and refitting a model of interest on each sample in order to obtain
more information about the fitted model
• Model Assessment: estimate test error rates
• Model Selection: select the appropriate level of model flexibility
• They are computationally expensive! But these days we have
powerful computers 
• Two resampling methods:
• Cross Validation
• Bootstrapping
IOM 530: Intro. to Statistical Learning 54

Cross-validation
• Cross-validation is a resampling technique to overcome overfitting.

The Validation Set Approach
• Suppose that we would like to find a set of variables that give the
lowest test (not training) error rate
• If we have a large data set, we can achieve this goal by randomly
splitting the data into training and validation(testing) parts
• We would then use the training part to build each possible model (i.e.
the different combinations of variables) and choose the model that
gave the lowest error rate when applied to the validation data
IOM 530: Intro. to Statistical Learning 57Training Data Testing Data

Example: Auto Data
• Suppose that we want to predict mpg from horsepower
• Two models:
• mpg ~ horsepower
• mpg ~ horsepower + horspower2
• Which model gives a better fit?
• Randomly split Auto data set into training (196 obs.) and validation data (196
obs.)
• Fit both models using the training data set
• Then, evaluate both models using the validation data set
• The model with the lowest validation (testing) MSE is the winner!

Results: Auto Data
• Left: Validation error rate for a single split
• Right: Validation method repeated 10 times, each time the split is
done randomly!
• There is a lot of variability among the MSE’s… Not good! We need
more stable methods!

Leave-One-Out Cross Validation (LOOCV)
• This method is similar to the Validation
Set Approach, but it tries to address the
latter’s disadvantages
• For each suggested model, do:
• Split the data set of size n into
• Training data set (blue) size: n -1
• Validation data set (beige) size: 1
• Fit the model using the training data
• Validate model using the validation data, and
compute the corresponding MSE
• Repeat this process n times
• The MSE for the model is computed as follows:
FI GU R E 5.3. A schematic display of LOOCV. A set of n data points is repeat-
edly split into a training set (shown in blue) containing all but one observation,
and a validation set that contains only that observation (shown in beige). The test
error is then estimated by averaging the n resulting MSE’s. The first training set
contains all but observation 1, the second training set contains all but observation
2, and so forth.
observations, and a prediction ˆy1 is made for the excluded observation,
using itsvaluex1. Since(x1, y1) wasnot used in thefitting process, MSE1 =
(y1 − ˆy1)2
provides an approximately unbiased estimate for the test error.
But even though MSE1 is unbiased for the test error, it is a poor estimate
because it is highly variable, since it is based upon a single observation
(x1, y1).
We can repeat the procedure by selecting (x2, y2) for the validation
data, training the statistical learning procedure on the n − 1 observations
{ (x1, y1), (x3, y3), . . . , (xn , yn )} , and computing MSE2 = (y2− ˆy2)2
. Repeat-
ing this approach n times produces n squared errors, MSE1, . . . , MSEn .
The LOOCV estimate for the test MSE is the averageof these n test error
estimates:
CV(n ) =
1
n
n
i = 1
MSEi . (5.1)
A schematic of the LOOCV approach is illustrated in Figure 5.3.
LOOCV has a couple of major advantages over the validation set ap-
proach. First, it has far less bias. In LOOCV, we repeatedly fit the statis-

LOOCV vs. the Validation Set Approach
• LOOCV has less bias
• We repeatedly fit the statistical learning method using training data that contains n-1
obs., i.e. almost all the data set is used
• LOOCV produces a less variable MSE
• The validation approach produces different MSE when applied repeatedly due to
randomness in the splitting process, while performing LOOCV multiple times will
always yield the same results, because we split based on 1 obs. each time
• LOOCV is computationally intensive (disadvantage)
• We fit the each model n times!

k-fold Cross Validation
• LOOCV is computationally intensive, so we can run k-fold Cross Validation instead
• With k-fold Cross Validation, we divide the data set into K different parts (e.g. K = 5, or K
= 10, etc.)
• We then remove the first part, fit the model on the remaining K-1 parts, and see how
good the predictions are on the left out part (i.e. compute the MSE on the first part)
• We then repeat this K different times taking out a different part each time
• By averaging the K different MSE’s we get an estimated validation (test) error rate for
new observations
FI GU R E 5.5. A schematic display of 5-fold CV. A set of n observations is
randomly split into five non-overlapping groups. Each of these fifths acts as a
validation set (shown in beige), and the remainder as a training set (shown in
blue). The test error is estimated by averaging the five resulting MSE estimates.
chapters. The magic formula (5.2) does not hold in general, in which case
the model has to be refit n times.
5.1.3 k-Fold Cross-Validation
An alternative to LOOCV is k-fold CV. This approach involves randomly
k-fold CV
dividing the set of observations into k groups, or folds, of approximately
equal size. The first fold is treated as a validation set, and the method
is fit on the remaining k − 1 folds. The mean squared error, MSE1, is
then computed on the observations in the held-out fold. This procedure is
repeated k times; each time, a different group of observations is treated
as a validation set. This process results in k estimates of the test error,
MSE1, MSE2, . . . , MSEk . The k-fold CV estimateis computed by averaging
these values,
CV(k) =
1
k
k
i = 1
MSEi . (5.3)
Figure 5.5 illustrates the k-fold CV approach.
It isnot hard to seethat LOOCV isa special caseof k-fold CV in which k

K-fold Cross Validation

Auto Data: LOOCV vs. K-fold CV
• Left: LOOCV error curve
• Right: 10-fold CV was run many times, and the figure shows the slightly different CV error
rates
• LOOCV is a special case of k-fold, where k = n
• They are both stable, but LOOCV is more computationally intensive!

Auto Data: Validation Set Approach vs. K-fold
CV Approach
• Left: Validation Set Approach
• Right: 10-fold Cross Validation Approach
• Indeed, 10-fold CV is more stable!

Bias- Variance Trade-off for k-fold CV
• Putting aside that LOOCV is more computationally intensive than k-fold CV… Which is
better LOOCV or K-fold CV?
• LOOCV is less bias than k-fold CV (when k < n)
• But, LOOCV has higher variance than k-fold CV (when k < n)
• Thus, there is a trade-off between what to use
• Conclusion:
• We tend to use k-fold CV with (K = 5 and K = 10)
• These are the magical K’s 
• It has been empirically shown that they yield test error rate estimates that suffer neither from
excessively high bias, nor from very high variance

Cross Validation on Classification Problems
• Cross validation can be used in a classification situation in a similar
manner
• Divide data into K parts
• Hold out one part, fit using the remaining data and compute the error rate on
the hold out data
• Repeat K times
• CV error rate is the average over the K errors we have computed

Software? R
• http://www.ats.ucla.edu/stat/r/library/bootstrap.htm
• http://spark.rstudio.com/ahmed/bootstrap/
• http://spark.rstudio.com/ahmed/permutation/

Resampling Methods Explained

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Resampling Methods Explained

Similar to Resampling Methods Explained (20)

More from Setia Pramana

More from Setia Pramana (20)

Recently uploaded

Recently uploaded (20)

Resampling Methods Explained