BOOTSTRAPPING IN
STATISTICS
PRESENTATION BY – SUVAM DAS
Contents
o What is Bootstrapping?
o Why Bootstrapping Statistics?
o When Bootstrapping Statistics?
o What is Bootstrapping Statistics?
o Replacement and Sampling
o How Bootstrapping Statistics Works?
o Bootstrapping Example
o Confidence Interval
o Example of Using Bootstrapping
o How well does Bootstrapping work?
o For which sample statistics?
o Differences between Bootstrapping and
Traditional Hypothesis Testing
o When Bootstrapping isn’t useful?
o Applications of Bootstrapping Method
o Advantages of Bootstrapping Statistics
o Limitations of Bootstrapping Statistics
o Common Misconceptions
o Conclusion
o References
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 2
This Photo is licensed under CC BY-SA-NC
What is Bootstrapping?
 The term "bootstrapping" came from the phrase "to
pull oneself up by one's bootstraps". The phrase was
used in the 18th and 19th centuries to refer to an
impossible task. For example, "He left school at 15 and
pulled himself up by his bootstraps to ultimately head up
a company".
 The phrase is believed to have come from the German
author Rudolf Erich Raspe, who wrote about a character
who pulled himself out of a swamp by pulling his own
hair.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 3
Why Bootstrapping Statistics?
 Statistical inference generally relies on the sampling distribution and
the standard error of the feature of interest. The traditional approach, or large
sample approach, draws one sample of size n from the population, and that
sample is used to calculate population estimates to then make inferences on. In
reality, only one sample has been observed.
 The theory states that, under certain conditions such as large sample sizes, the
sampling distribution will be approximately normal, and the standard deviation
of the distribution will be equal to the standard error. But what happens if the
sample size is not sufficiently large enough? Then, it can’t necessarily be
assumed that the theoretical sampling distribution is normal. This makes it
difficult to determine the standard error of the estimate and harder to draw
reasonable conclusions from the data.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 4
When Bootstrapping Statistics?
 The sample size is small
 The population distribution is unknown
 The statistic of interest is complex or non-standard
 There is no analytical form or asymptotic theory to help estimate
the distribution of the statistics of interest
 The distribution is not clean
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 5
 In the context of statistics and data science, bootstrapping means something
more specific and possible. Bootstrapping is a method of inferring results for a
population from results found on a collection of smaller random samples of that
population, using replacement during the sampling process. This relates back to
the original phrase because it belies the notion that the sample is only relying on
smaller samples of itself to make calculations on, in order to draw conclusions
for the larger population.
 “Bootstrapping is a statistical procedure that resamples a single data set to
create many simulated samples. This process allows for the calculation of
standard errors, confidence intervals, and hypothesis testing,” according to
a post on bootstrapping statistics from statistician Jim Frost.
What is Bootstrapping Statistics?
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 6
Replacement and Sampling
 The process of sampling with replacement is used in statistics and probability,
as well as in various algorithms and simulations. When one sample with
replacement, the probability of selecting a particular item remains the same for
each draw, as each draw is independent of the others.
 Sampling with replacement is a method of selecting items from a dataset in
which each selection is replaced before the next one is made. In other words, an
item can be selected more than once in the sampling process. This is in contrast
to sampling without replacement, where each item can be selected only once.
 Sampling with replacement is commonly used in bootstrap resampling
methods, it is also used in Monte Carlo simulations and various other statistical
and computational techniques.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 7
Replacement and Sampling
We continue until the other box has same number of balls as the original one.
We first randomly pick a ball and replicate it,
then move the replicate ball to another box. Then, randomly pick another ball and do the same
like before.
Imagine we have 5 different colored balls in a box.
This is called sampling with replacement.
Done ✅
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 8
How Bootstrapping Statistics Works?
In the bootstrapping approach, a sample of size n is drawn from the population.
Let’s call this sample S. Then, rather than using theory to determine all possible
estimates, the sampling distribution is created by resampling observations with
replacement from S for m times, with each resampled set having n observations.
Now, if sampled appropriately, S should be representative of the population.
Therefore, by resampling S for m times with replacement, it would be as
if m samples were drawn from the original population, and the estimates
derived would be representative of the theoretical distribution under the
traditional approach.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 9
How Bootstrapping Statistics Works?
Main
Population
. . .
Drawn a sample of size n(<30)
x1, x2, x3, … , xn
Actual Sample
x*1
1, x*1
2, x*1
3, … , x*1
n
Re-sample 1
𝜃1
x*2
1, x*2
2, x*2
3, … , x*2
n
Re-sample 2
𝜃2
x*3
1, x*3
2, x*3
3, … , x*3
n
Re-sample 3
𝜃3
x*m
1, x*m
2, x*m
3, … , x*m
n
Re-sample m
𝜃𝑚
Here 𝜃∗ represents the estimate of the model parameters
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 10
How Bootstrapping Statistics Works?
 Increasing the number of resamples, m, will not increase the amount of
information in the data. That is, resampling the original set 100,000 times is not
more useful than resampling it 1,000 times. The amount of information within
the set is dependent on the sample size, n, which will remain constant
throughout each resample.
 The benefit of more resamples, then, is to derive a better estimate of the
sampling distribution.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 11
Bootstrapping Example
Say, someone wanted to know how many shoes he has. How would he find out? He would just count
them. But what if he wanted to know the average number of shoes that everyone in his office owns?
He works for a big company, there are 1,000 people total in the company building. Asking each person
how many shoes they have may be impractical and time-consuming. Instead, he can use bootstrap.
One day, he surveys 50 people (without replacement) and records how many shoes each person has.
Now instead of repeating this procedure every day, he takes those 50 data points and create a lot of
bootstrapped samples from them. Each bootstrapped re-sample set has 50 randomly chosen samples,
with replacement. For each of these re-samples, he measures the mean number of shoes owned.
After doing this 100 times, he has 100 estimates of the average number of shoes his co-workers own.
He can use them to calculate confidence intervals and be able to conclude something like, “It is 95%
likely that the average number of shoes owned by people in this company is between 6 and 10.”
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 12
Confidence Interval
If the bootstrap distribution is approximately symmetric, we can construct a confidence
interval by finding the percentiles in the bootstrap distribution.
The P% confidence interval would be (
100−𝑃
2
𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒,
100+𝑃
2
𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒)
For a 95% confidence interval, we need to identify
the middle 95% of the distribution. To do that, we
use the 97.5th percentile and the 2.5th percentile
(97.5 – 2.5 = 95). In other words, if we order all
sample means from low to high, and then chop off
the lowest 2.5% and the highest 2.5% of the means,
the middle 95% of the means remain. That range is
our bootstrapped confidence interval.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 13
Confidence Interval
Bootstrap distributions are usually symmetric and bell-shaped. If the bootstrap
distribution is not symmetric, constructing a confidence interval solely based on
percentiles may not accurately represent the uncertainty in your estimate,
especially if there is substantial skewness. In such cases, using methods that
account for the asymmetry of the distribution would be more appropriate.
One widely used method in these situations is the Bias-Corrected and
Accelerated (BCa) interval. The BCa interval adjusts for both bias and skewness
in the bootstrap distribution, providing a more accurate confidence interval
when the distribution is not symmetric. Here, using this method the CI is
adjusted to the right and left if the bootstrap distribution is skewed positively
and negatively respectively.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 14
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 15
How well does Bootstrapping work?
The bootstrap method has been around since 1979, and its usage has increased.
Various studies over the intervening decades have determined that bootstrap
sampling distributions approximate the correct sampling distributions with
central assumption that the original sample accurately represents the actual
population.
The resampling process creates many possible samples that a study could have
drawn. The various combinations of values in the simulated samples collectively
provide an estimate of the variability between random samples drawn from the
same population. The range of these potential samples allows the procedure to
construct confidence intervals and perform hypothesis testing. Importantly, as
the sample size increases, bootstrapping converges on the correct sampling
distribution under most conditions.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 16
How well does
Bootstrapping
work?
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 17
For which sample statistics?
While now focusing on the sample mean, the bootstrap method can analyze a broad
range of sample statistics and properties. These statistics include the mean,
median, mode, standard deviation, analysis of variance, correlations, regression
coefficients, proportions, odds ratios, variance in binary data, and multivariate statistics
among others.
• Mean: Bootstrapping can be used to estimate the sampling distribution of the mean.
This is particularly useful when the sample size is small or when the underlying
population distribution is not normal.
• Variance and Standard Deviation: Bootstrapping can be applied to estimate the
variance and standard deviation of a sample. It is especially useful when the assumption
of normality is in question.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 18
For which sample statistics?
• Median: Bootstrapping can be used to estimate the sampling distribution of the
median. This is valuable when dealing with non-normally distributed data or small
sample sizes.
• Correlation Coefficient: Bootstrapping is applicable to estimate the sampling
distribution of correlation coefficients. This can be useful when dealing with bivariate
data.
• Outliers: Bootstrapping can be applied to assess the impact of outliers on statistical
measures and to identify influential observations.
• Percentiles and Confidence Intervals: Bootstrapping is often used to estimate
confidence intervals for percentiles, which can be particularly relevant for skewed data.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 19
For which sample statistics?
• Skewness and Kurtosis: Bootstrapping can be used to estimate the sampling
distribution of skewness and kurtosis, which are measures of the shape of a
distribution.
• Regression Coefficients: Bootstrapping can be employed to estimate the uncertainty
associated with regression coefficients. This is beneficial when assumptions of
normality or homoscedasticity are violated.
• Classification Metrics: In machine learning, bootstrapping can be used to estimate
confidence intervals for classification metrics such as accuracy, precision, recall, and F1
score.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 20
Differences between Bootstrapping and
Traditional Hypothesis Testing
1) Sampling Method:
• Traditional Hypothesis Testing: Relies on theoretical distributions and
assumptions. It often assumes that the sample is a random representation of the
population, and the analysis is based on predefined statistical distributions (e.g.,
normal distribution).
• Bootstrapping: Involves resampling with replacement from the observed data.
Instead of assuming the distribution, it uses the sample to estimate the sampling
distribution.
2) Parameter Estimation:
• Traditional Hypothesis Testing: Involves estimating parameters of the population
based on the sample and using them to make inferences.
• Bootstrapping: Estimates parameters by repeatedly resampling from the observed
data, creating a distribution of the parameter not assuming a specific distribution.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 21
Differences between Bootstrapping and
Traditional Hypothesis Testing
3) Sample Size Requirements:
• Traditional Hypothesis Testing: May require a sufficiently large sample size to
meet the assumptions of the chosen statistical test.
• Bootstrapping: Can be more robust with smaller sample sizes, as it generates
its own "virtual samples" through resampling.
4) Statistical Inference:
• Traditional Hypothesis Testing: Involves comparing a test statistic (calculated
from the sample) to a critical value from a theoretical distribution to make
inferences about the population parameter.
• Bootstrapping: Constructs confidence intervals and makes inferences based
on the distribution of the parameter obtained from resampling.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 22
Differences between Bootstrapping and
Traditional Hypothesis Testing
5) Computational Intensity:
• Traditional Hypothesis Testing: Often involves simpler calculations based on
theoretical distributions.
• Bootstrapping: Can be computationally intensive, especially with a large
number of resample
In short, bootstrapping is a more flexible and distribution-free method that is
particularly useful when assumptions about the population distribution cannot be
met or when dealing with small sample sizes. Traditional hypothesis testing, on the
other hand, relies on specific assumptions about population distributions and may
be more appropriate in situations where these assumptions are reasonable and can
be met.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 23
When Bootstrapping isn’t useful?
• Small sample sizes: Bootstrapping does not contain more information about the population
than the original sample.
• Biased data: Bootstrapping can give a false sense if the data is skewed or biased.
• Unrepresentative samples: If the samples are not representative of the whole population, then
bootstrap will not be very accurate.
• Infinite population variance: Bootstrapping is not appropriate when the population variance is
infinite.
• Discontinuous population values: Bootstrapping is not appropriate when the population values
are discontinuous at the median.
• Unknown underlying distribution: Bootstrapping is not useful when the underlying distribution
is unknown.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 24
Applications of Bootstrapping Method
The bootstrapping method is versatile and can be applied in various disciplines
where statistical inference is necessary, especially when the assumptions of
traditional methods are in question or when dealing with small sample sizes.
 Testing hypotheses: Traditional statistical methods often try to make
generalizations about a data set based on a single sample. By gleaning insights from
thousands of simulated samples, the bootstrapping method makes it possible to
determine more accurate calculations.
 Creating confidence intervals: When calculating a statistic of interest,
bootstrapping can generate thousands of simulated samples that each feature their
own statistic of interest. Teams can then develop a confidence interval that is more
precise since it relies on a larger collection of samples as opposed to just one
sample or a few samples.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 25
Applications of Bootstrapping Method
 Calculating standard error: Bootstrapping is more equipped than traditional
methods for calculating standard error since it generates many simulated
samples at random. This makes it easier to determine the means of different
samples and estimate a sampling distribution that is more reflective of the larger
data set and can be used to find the standard error.
 Training machine learning algorithms: Machine learning algorithms can be
trained with an initial sample. Bootstrapping adds another dimension to this
process by resampling this initial sample to produce simulated samples, which
algorithms are exposed to post-training. This process provides a clearer picture
as to how machine learning algorithms perform outside of training.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 26
Applications of Bootstrapping Method
 Portfolio Management: In finance, bootstrapping is often used for estimating
the distribution of financial returns, helping to assess portfolio risk and optimize
asset allocation strategies.
 Biostatistics: Bootstrapping is used in clinical trials for estimating the
distribution of treatment effects, confidence intervals for efficacy measures, and
assessing the robustness of results.
 Environmental Studies: Bootstrapping is employed in ecological studies to
estimate species richness, diversity indices, and other ecological parameters.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 27
Advantages of Bootstrapping Statistics
 Simplicity: Bootstrapping is a straightforward way to derive estimates of
standard errors and confidence intervals.
 Robust statistical inference: Bootstrapping allows for robust statistical
inference without relying on strong assumptions about the underlying data
distribution.
 No assumptions about data: Bootstrapping doesn't need you to make any
assumptions about the data, such as normality.
 Convenience: Bootstrapping avoids the cost of repeating the experiment to
get other groups of sampled data.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 28
Advantages of Bootstrapping Statistics
 Easy to implement and understand: Bootstrapping is easy to implement and
understand without requiring complex formulas or calculations.
 Confidence Intervals: The method provides a straightforward way to estimate
the variability of a statistic and calculate confidence intervals without making
strong distributional assumptions.
 Wide variety of problems: Bootstrapping can be applied to a wide variety of
problems, including nonlinear regression, classification, confidence interval
estimation, bias estimation, adjustment of p-values, and time series analysis.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 29
Limitations of Bootstrapping Statistics
 Sample size: The original sample size is the only real limitation. As the sample
size increases, the estimated parameter becomes more accurate. If the samples
are not representative of the population, then bootstrap will be very inaccurate.
 Time-consuming: Thousands of simulated samples are needed for
bootstrapping to be accurate.
 No new information: A bootstrap sample can only tell us things about the
original sample, and won't give any new information about the real population.
 Drawbacks: Bootstrapping has certain limitations and drawbacks when used
for data resampling, particularly when the sample size is large, the population
distribution is known, or the statistic of interest is simple or standard.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 30
Limitations of Bootstrapping Statistics
 Margin of error: The results it provides cannot be understood to be correct
with 100% certainty. There will be a margin of error.
 Computationally taxing: Because bootstrapping requires thousands of
samples and takes longer to complete, it also demands higher levels of
computational power.
 Incompatible at times: Bootstrapping isn’t always the best fit for a situation,
especially when dealing with spatial data or a time series.
 Prone to bias: Bootstrapping doesn’t always take into account the variability of
distributions, leading to errors and biases when making calculations.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 31
Common Misconceptions
• Bootstrapping Always Yields Accurate Results: Bootstrapping provides
estimates based on the observed data. However, its accuracy depends on the
quality and representativeness of the original sample. Inadequate or biased
samples can lead to unreliable results.
• Bootstrapping Can Fix a Poorly Collected Dataset: Bootstrapping cannot
compensate for fundamental issues in data collection. It is not a cure for biased
sampling methods, measurement errors, or other data quality issues. Careful
data collection remains crucial.
• Bootstrapping is Computationally Intensive for Large Datasets: While
bootstrapping involves resampling, modern computing power has significantly
reduced the computational burden. Efficient algorithms and parallel processing
can make bootstrapping feasible even with large datasets.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 32
Common Misconceptions
• Bootstrapping is Only for Small Datasets: Bootstrapping is versatile and
applicable to datasets of various sizes. It can be particularly beneficial for small
sample sizes, but its usefulness extends to larger datasets, especially in
scenarios with complex relationships.
• Bootstrapping Cannot Handle Categorical Data: Bootstrapping can be adapted
for categorical data through methods like the stratified bootstrap. It allows for
resampling within strata defined by categorical variables, making it applicable to
a broader range of data types.
• Bootstrapping Eliminates the Need for Traditional Statistics: Bootstrapping is a
valuable complement to traditional methods but not a replacement. It is
essential to consider the specific characteristics of the data and the question
when choosing between bootstrapping and traditional statistical approaches.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 33
Conclusion
In conclusion, Bootstrapping emerges as a powerful and flexible tool in the
statistician's arsenal. Through its resampling technique, bootstrapping allows us
to make robust inferences, particularly in scenarios where traditional methods
falter. We've explored its fundamental concepts, steps, advantages, and
limitations.
As we've seen, bootstrapping offers simplicity without sacrificing accuracy. Its
wide range of applications, from estimating confidence intervals to hypothesis
testing and model validation, makes it a versatile approach in various fields.
The comparisons with traditional methods underscore its resilience and
effectiveness. By understanding and addressing challenges, researchers can
harness the full potential of bootstrapping in their analyses.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 34
Conclusion
Through case studies, we've witnessed real-world applications that demonstrate
how bootstrapping has contributed to sound statistical decision-making. It's not
without its challenges, but with careful consideration and implementation of best
practices, these challenges can be navigated.
In essence, bootstrapping is not just a statistical technique; it's a paradigm shift
in how we approach uncertainty. As we continue to explore and push the
boundaries of statistical methodologies, bootstrapping stands as a testament to
the importance of adaptability and innovation in the field of statistics.
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 35
References
• https://www.mastersindatascience.org/learning/machine-learning-
algorithms/bootstrapping/
• https://www.shopify.com/blog/what-is-
bootstrapping#:~:text=The%20term%20%E2%80%9Cbootstrapping
%E2%80%9D%20originated%20with,making%20something%20out
%20of%20nothing
• https://dictionary.cambridge.org/dictionary/english/pull-haul-up-
by-the-your-own-
bootstraps#:~:text=to%20improve%20your%20situation%20withou
t,ultimately%20head%20up%20a%20company
• https://zapier.com/blog/you-cant-pull-yourself-up-by-your-
bootstraps/#:~:text=The%20expression%20%22pull%20yourself%20
up,by%20pulling%20his%20own%20hair
• https://builtin.com/data-science/bootstrapping-statistics
• https://statisticsbyjim.com/hypothesis-testing/bootstrapping/
• https://stats.stackexchange.com/questions/280725/pros-and-cons-
of-
bootstrapping#:~:text=It%20can%20be%20applied%20to,analysis%
20to%20name%20a%20few
• https://towardsdatascience.com/bootstrapping-statistics-what-it-is-
and-why-its-used-
e2fa29577307#:~:text=%E2%80%9CThe%20advantages%20of%20b
ootstrapping%20are,other%20groups%20of%20sampled%20data
• https://www.analyticsvidhya.com/blog/2020/02/what-is-bootstrap-
sampling-in-statistics-and-machine-
learning/#:~:text=The%20advantage%20of%20bootstrap%20sampli
ng%20is%20that%20it%20allows%20for,helping%20to%20quantify
%20its%20uncertainty
• https://www.linkedin.com/advice/1/what-advantages-
disadvantages-bootstrapping-
data#:~:text=It%20has%20several%20advantages%2C%20such%20a
s%20being,understand%20without%20requiring%20complex%20for
mulas%20or%20calculations
• https://www.lancaster.ac.uk/stor-i-student-sites/katie-
howgate/2021/02/05/bootstrapping/#:~:text=A%20key%20advanta
ge%20is%20that,the%20information%20you%20actually%20have
• https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#:~:text=thi
s%20template%20message)-
,Advantages,Odds%20ratio%2C%20and%20correlation%20coefficie
nts
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 36
Thank You
05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 37

BOOTSTRAPPING IN STATISTICS

  • 1.
  • 2.
    Contents o What isBootstrapping? o Why Bootstrapping Statistics? o When Bootstrapping Statistics? o What is Bootstrapping Statistics? o Replacement and Sampling o How Bootstrapping Statistics Works? o Bootstrapping Example o Confidence Interval o Example of Using Bootstrapping o How well does Bootstrapping work? o For which sample statistics? o Differences between Bootstrapping and Traditional Hypothesis Testing o When Bootstrapping isn’t useful? o Applications of Bootstrapping Method o Advantages of Bootstrapping Statistics o Limitations of Bootstrapping Statistics o Common Misconceptions o Conclusion o References 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 2
  • 3.
    This Photo islicensed under CC BY-SA-NC What is Bootstrapping?  The term "bootstrapping" came from the phrase "to pull oneself up by one's bootstraps". The phrase was used in the 18th and 19th centuries to refer to an impossible task. For example, "He left school at 15 and pulled himself up by his bootstraps to ultimately head up a company".  The phrase is believed to have come from the German author Rudolf Erich Raspe, who wrote about a character who pulled himself out of a swamp by pulling his own hair. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 3
  • 4.
    Why Bootstrapping Statistics? Statistical inference generally relies on the sampling distribution and the standard error of the feature of interest. The traditional approach, or large sample approach, draws one sample of size n from the population, and that sample is used to calculate population estimates to then make inferences on. In reality, only one sample has been observed.  The theory states that, under certain conditions such as large sample sizes, the sampling distribution will be approximately normal, and the standard deviation of the distribution will be equal to the standard error. But what happens if the sample size is not sufficiently large enough? Then, it can’t necessarily be assumed that the theoretical sampling distribution is normal. This makes it difficult to determine the standard error of the estimate and harder to draw reasonable conclusions from the data. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 4
  • 5.
    When Bootstrapping Statistics? The sample size is small  The population distribution is unknown  The statistic of interest is complex or non-standard  There is no analytical form or asymptotic theory to help estimate the distribution of the statistics of interest  The distribution is not clean 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 5
  • 6.
     In thecontext of statistics and data science, bootstrapping means something more specific and possible. Bootstrapping is a method of inferring results for a population from results found on a collection of smaller random samples of that population, using replacement during the sampling process. This relates back to the original phrase because it belies the notion that the sample is only relying on smaller samples of itself to make calculations on, in order to draw conclusions for the larger population.  “Bootstrapping is a statistical procedure that resamples a single data set to create many simulated samples. This process allows for the calculation of standard errors, confidence intervals, and hypothesis testing,” according to a post on bootstrapping statistics from statistician Jim Frost. What is Bootstrapping Statistics? 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 6
  • 7.
    Replacement and Sampling The process of sampling with replacement is used in statistics and probability, as well as in various algorithms and simulations. When one sample with replacement, the probability of selecting a particular item remains the same for each draw, as each draw is independent of the others.  Sampling with replacement is a method of selecting items from a dataset in which each selection is replaced before the next one is made. In other words, an item can be selected more than once in the sampling process. This is in contrast to sampling without replacement, where each item can be selected only once.  Sampling with replacement is commonly used in bootstrap resampling methods, it is also used in Monte Carlo simulations and various other statistical and computational techniques. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 7
  • 8.
    Replacement and Sampling Wecontinue until the other box has same number of balls as the original one. We first randomly pick a ball and replicate it, then move the replicate ball to another box. Then, randomly pick another ball and do the same like before. Imagine we have 5 different colored balls in a box. This is called sampling with replacement. Done ✅ 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 8
  • 9.
    How Bootstrapping StatisticsWorks? In the bootstrapping approach, a sample of size n is drawn from the population. Let’s call this sample S. Then, rather than using theory to determine all possible estimates, the sampling distribution is created by resampling observations with replacement from S for m times, with each resampled set having n observations. Now, if sampled appropriately, S should be representative of the population. Therefore, by resampling S for m times with replacement, it would be as if m samples were drawn from the original population, and the estimates derived would be representative of the theoretical distribution under the traditional approach. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 9
  • 10.
    How Bootstrapping StatisticsWorks? Main Population . . . Drawn a sample of size n(<30) x1, x2, x3, … , xn Actual Sample x*1 1, x*1 2, x*1 3, … , x*1 n Re-sample 1 𝜃1 x*2 1, x*2 2, x*2 3, … , x*2 n Re-sample 2 𝜃2 x*3 1, x*3 2, x*3 3, … , x*3 n Re-sample 3 𝜃3 x*m 1, x*m 2, x*m 3, … , x*m n Re-sample m 𝜃𝑚 Here 𝜃∗ represents the estimate of the model parameters 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 10
  • 11.
    How Bootstrapping StatisticsWorks?  Increasing the number of resamples, m, will not increase the amount of information in the data. That is, resampling the original set 100,000 times is not more useful than resampling it 1,000 times. The amount of information within the set is dependent on the sample size, n, which will remain constant throughout each resample.  The benefit of more resamples, then, is to derive a better estimate of the sampling distribution. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 11
  • 12.
    Bootstrapping Example Say, someonewanted to know how many shoes he has. How would he find out? He would just count them. But what if he wanted to know the average number of shoes that everyone in his office owns? He works for a big company, there are 1,000 people total in the company building. Asking each person how many shoes they have may be impractical and time-consuming. Instead, he can use bootstrap. One day, he surveys 50 people (without replacement) and records how many shoes each person has. Now instead of repeating this procedure every day, he takes those 50 data points and create a lot of bootstrapped samples from them. Each bootstrapped re-sample set has 50 randomly chosen samples, with replacement. For each of these re-samples, he measures the mean number of shoes owned. After doing this 100 times, he has 100 estimates of the average number of shoes his co-workers own. He can use them to calculate confidence intervals and be able to conclude something like, “It is 95% likely that the average number of shoes owned by people in this company is between 6 and 10.” 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 12
  • 13.
    Confidence Interval If thebootstrap distribution is approximately symmetric, we can construct a confidence interval by finding the percentiles in the bootstrap distribution. The P% confidence interval would be ( 100−𝑃 2 𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒, 100+𝑃 2 𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒) For a 95% confidence interval, we need to identify the middle 95% of the distribution. To do that, we use the 97.5th percentile and the 2.5th percentile (97.5 – 2.5 = 95). In other words, if we order all sample means from low to high, and then chop off the lowest 2.5% and the highest 2.5% of the means, the middle 95% of the means remain. That range is our bootstrapped confidence interval. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 13
  • 14.
    Confidence Interval Bootstrap distributionsare usually symmetric and bell-shaped. If the bootstrap distribution is not symmetric, constructing a confidence interval solely based on percentiles may not accurately represent the uncertainty in your estimate, especially if there is substantial skewness. In such cases, using methods that account for the asymmetry of the distribution would be more appropriate. One widely used method in these situations is the Bias-Corrected and Accelerated (BCa) interval. The BCa interval adjusts for both bias and skewness in the bootstrap distribution, providing a more accurate confidence interval when the distribution is not symmetric. Here, using this method the CI is adjusted to the right and left if the bootstrap distribution is skewed positively and negatively respectively. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 14
  • 15.
    05-01-2024 B OO T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 15
  • 16.
    How well doesBootstrapping work? The bootstrap method has been around since 1979, and its usage has increased. Various studies over the intervening decades have determined that bootstrap sampling distributions approximate the correct sampling distributions with central assumption that the original sample accurately represents the actual population. The resampling process creates many possible samples that a study could have drawn. The various combinations of values in the simulated samples collectively provide an estimate of the variability between random samples drawn from the same population. The range of these potential samples allows the procedure to construct confidence intervals and perform hypothesis testing. Importantly, as the sample size increases, bootstrapping converges on the correct sampling distribution under most conditions. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 16
  • 17.
    How well does Bootstrapping work? 05-01-2024B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 17
  • 18.
    For which samplestatistics? While now focusing on the sample mean, the bootstrap method can analyze a broad range of sample statistics and properties. These statistics include the mean, median, mode, standard deviation, analysis of variance, correlations, regression coefficients, proportions, odds ratios, variance in binary data, and multivariate statistics among others. • Mean: Bootstrapping can be used to estimate the sampling distribution of the mean. This is particularly useful when the sample size is small or when the underlying population distribution is not normal. • Variance and Standard Deviation: Bootstrapping can be applied to estimate the variance and standard deviation of a sample. It is especially useful when the assumption of normality is in question. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 18
  • 19.
    For which samplestatistics? • Median: Bootstrapping can be used to estimate the sampling distribution of the median. This is valuable when dealing with non-normally distributed data or small sample sizes. • Correlation Coefficient: Bootstrapping is applicable to estimate the sampling distribution of correlation coefficients. This can be useful when dealing with bivariate data. • Outliers: Bootstrapping can be applied to assess the impact of outliers on statistical measures and to identify influential observations. • Percentiles and Confidence Intervals: Bootstrapping is often used to estimate confidence intervals for percentiles, which can be particularly relevant for skewed data. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 19
  • 20.
    For which samplestatistics? • Skewness and Kurtosis: Bootstrapping can be used to estimate the sampling distribution of skewness and kurtosis, which are measures of the shape of a distribution. • Regression Coefficients: Bootstrapping can be employed to estimate the uncertainty associated with regression coefficients. This is beneficial when assumptions of normality or homoscedasticity are violated. • Classification Metrics: In machine learning, bootstrapping can be used to estimate confidence intervals for classification metrics such as accuracy, precision, recall, and F1 score. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 20
  • 21.
    Differences between Bootstrappingand Traditional Hypothesis Testing 1) Sampling Method: • Traditional Hypothesis Testing: Relies on theoretical distributions and assumptions. It often assumes that the sample is a random representation of the population, and the analysis is based on predefined statistical distributions (e.g., normal distribution). • Bootstrapping: Involves resampling with replacement from the observed data. Instead of assuming the distribution, it uses the sample to estimate the sampling distribution. 2) Parameter Estimation: • Traditional Hypothesis Testing: Involves estimating parameters of the population based on the sample and using them to make inferences. • Bootstrapping: Estimates parameters by repeatedly resampling from the observed data, creating a distribution of the parameter not assuming a specific distribution. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 21
  • 22.
    Differences between Bootstrappingand Traditional Hypothesis Testing 3) Sample Size Requirements: • Traditional Hypothesis Testing: May require a sufficiently large sample size to meet the assumptions of the chosen statistical test. • Bootstrapping: Can be more robust with smaller sample sizes, as it generates its own "virtual samples" through resampling. 4) Statistical Inference: • Traditional Hypothesis Testing: Involves comparing a test statistic (calculated from the sample) to a critical value from a theoretical distribution to make inferences about the population parameter. • Bootstrapping: Constructs confidence intervals and makes inferences based on the distribution of the parameter obtained from resampling. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 22
  • 23.
    Differences between Bootstrappingand Traditional Hypothesis Testing 5) Computational Intensity: • Traditional Hypothesis Testing: Often involves simpler calculations based on theoretical distributions. • Bootstrapping: Can be computationally intensive, especially with a large number of resample In short, bootstrapping is a more flexible and distribution-free method that is particularly useful when assumptions about the population distribution cannot be met or when dealing with small sample sizes. Traditional hypothesis testing, on the other hand, relies on specific assumptions about population distributions and may be more appropriate in situations where these assumptions are reasonable and can be met. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 23
  • 24.
    When Bootstrapping isn’tuseful? • Small sample sizes: Bootstrapping does not contain more information about the population than the original sample. • Biased data: Bootstrapping can give a false sense if the data is skewed or biased. • Unrepresentative samples: If the samples are not representative of the whole population, then bootstrap will not be very accurate. • Infinite population variance: Bootstrapping is not appropriate when the population variance is infinite. • Discontinuous population values: Bootstrapping is not appropriate when the population values are discontinuous at the median. • Unknown underlying distribution: Bootstrapping is not useful when the underlying distribution is unknown. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 24
  • 25.
    Applications of BootstrappingMethod The bootstrapping method is versatile and can be applied in various disciplines where statistical inference is necessary, especially when the assumptions of traditional methods are in question or when dealing with small sample sizes.  Testing hypotheses: Traditional statistical methods often try to make generalizations about a data set based on a single sample. By gleaning insights from thousands of simulated samples, the bootstrapping method makes it possible to determine more accurate calculations.  Creating confidence intervals: When calculating a statistic of interest, bootstrapping can generate thousands of simulated samples that each feature their own statistic of interest. Teams can then develop a confidence interval that is more precise since it relies on a larger collection of samples as opposed to just one sample or a few samples. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 25
  • 26.
    Applications of BootstrappingMethod  Calculating standard error: Bootstrapping is more equipped than traditional methods for calculating standard error since it generates many simulated samples at random. This makes it easier to determine the means of different samples and estimate a sampling distribution that is more reflective of the larger data set and can be used to find the standard error.  Training machine learning algorithms: Machine learning algorithms can be trained with an initial sample. Bootstrapping adds another dimension to this process by resampling this initial sample to produce simulated samples, which algorithms are exposed to post-training. This process provides a clearer picture as to how machine learning algorithms perform outside of training. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 26
  • 27.
    Applications of BootstrappingMethod  Portfolio Management: In finance, bootstrapping is often used for estimating the distribution of financial returns, helping to assess portfolio risk and optimize asset allocation strategies.  Biostatistics: Bootstrapping is used in clinical trials for estimating the distribution of treatment effects, confidence intervals for efficacy measures, and assessing the robustness of results.  Environmental Studies: Bootstrapping is employed in ecological studies to estimate species richness, diversity indices, and other ecological parameters. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 27
  • 28.
    Advantages of BootstrappingStatistics  Simplicity: Bootstrapping is a straightforward way to derive estimates of standard errors and confidence intervals.  Robust statistical inference: Bootstrapping allows for robust statistical inference without relying on strong assumptions about the underlying data distribution.  No assumptions about data: Bootstrapping doesn't need you to make any assumptions about the data, such as normality.  Convenience: Bootstrapping avoids the cost of repeating the experiment to get other groups of sampled data. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 28
  • 29.
    Advantages of BootstrappingStatistics  Easy to implement and understand: Bootstrapping is easy to implement and understand without requiring complex formulas or calculations.  Confidence Intervals: The method provides a straightforward way to estimate the variability of a statistic and calculate confidence intervals without making strong distributional assumptions.  Wide variety of problems: Bootstrapping can be applied to a wide variety of problems, including nonlinear regression, classification, confidence interval estimation, bias estimation, adjustment of p-values, and time series analysis. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 29
  • 30.
    Limitations of BootstrappingStatistics  Sample size: The original sample size is the only real limitation. As the sample size increases, the estimated parameter becomes more accurate. If the samples are not representative of the population, then bootstrap will be very inaccurate.  Time-consuming: Thousands of simulated samples are needed for bootstrapping to be accurate.  No new information: A bootstrap sample can only tell us things about the original sample, and won't give any new information about the real population.  Drawbacks: Bootstrapping has certain limitations and drawbacks when used for data resampling, particularly when the sample size is large, the population distribution is known, or the statistic of interest is simple or standard. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 30
  • 31.
    Limitations of BootstrappingStatistics  Margin of error: The results it provides cannot be understood to be correct with 100% certainty. There will be a margin of error.  Computationally taxing: Because bootstrapping requires thousands of samples and takes longer to complete, it also demands higher levels of computational power.  Incompatible at times: Bootstrapping isn’t always the best fit for a situation, especially when dealing with spatial data or a time series.  Prone to bias: Bootstrapping doesn’t always take into account the variability of distributions, leading to errors and biases when making calculations. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 31
  • 32.
    Common Misconceptions • BootstrappingAlways Yields Accurate Results: Bootstrapping provides estimates based on the observed data. However, its accuracy depends on the quality and representativeness of the original sample. Inadequate or biased samples can lead to unreliable results. • Bootstrapping Can Fix a Poorly Collected Dataset: Bootstrapping cannot compensate for fundamental issues in data collection. It is not a cure for biased sampling methods, measurement errors, or other data quality issues. Careful data collection remains crucial. • Bootstrapping is Computationally Intensive for Large Datasets: While bootstrapping involves resampling, modern computing power has significantly reduced the computational burden. Efficient algorithms and parallel processing can make bootstrapping feasible even with large datasets. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 32
  • 33.
    Common Misconceptions • Bootstrappingis Only for Small Datasets: Bootstrapping is versatile and applicable to datasets of various sizes. It can be particularly beneficial for small sample sizes, but its usefulness extends to larger datasets, especially in scenarios with complex relationships. • Bootstrapping Cannot Handle Categorical Data: Bootstrapping can be adapted for categorical data through methods like the stratified bootstrap. It allows for resampling within strata defined by categorical variables, making it applicable to a broader range of data types. • Bootstrapping Eliminates the Need for Traditional Statistics: Bootstrapping is a valuable complement to traditional methods but not a replacement. It is essential to consider the specific characteristics of the data and the question when choosing between bootstrapping and traditional statistical approaches. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 33
  • 34.
    Conclusion In conclusion, Bootstrappingemerges as a powerful and flexible tool in the statistician's arsenal. Through its resampling technique, bootstrapping allows us to make robust inferences, particularly in scenarios where traditional methods falter. We've explored its fundamental concepts, steps, advantages, and limitations. As we've seen, bootstrapping offers simplicity without sacrificing accuracy. Its wide range of applications, from estimating confidence intervals to hypothesis testing and model validation, makes it a versatile approach in various fields. The comparisons with traditional methods underscore its resilience and effectiveness. By understanding and addressing challenges, researchers can harness the full potential of bootstrapping in their analyses. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 34
  • 35.
    Conclusion Through case studies,we've witnessed real-world applications that demonstrate how bootstrapping has contributed to sound statistical decision-making. It's not without its challenges, but with careful consideration and implementation of best practices, these challenges can be navigated. In essence, bootstrapping is not just a statistical technique; it's a paradigm shift in how we approach uncertainty. As we continue to explore and push the boundaries of statistical methodologies, bootstrapping stands as a testament to the importance of adaptability and innovation in the field of statistics. 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 35
  • 36.
    References • https://www.mastersindatascience.org/learning/machine-learning- algorithms/bootstrapping/ • https://www.shopify.com/blog/what-is- bootstrapping#:~:text=The%20term%20%E2%80%9Cbootstrapping %E2%80%9D%20originated%20with,making%20something%20out %20of%20nothing •https://dictionary.cambridge.org/dictionary/english/pull-haul-up- by-the-your-own- bootstraps#:~:text=to%20improve%20your%20situation%20withou t,ultimately%20head%20up%20a%20company • https://zapier.com/blog/you-cant-pull-yourself-up-by-your- bootstraps/#:~:text=The%20expression%20%22pull%20yourself%20 up,by%20pulling%20his%20own%20hair • https://builtin.com/data-science/bootstrapping-statistics • https://statisticsbyjim.com/hypothesis-testing/bootstrapping/ • https://stats.stackexchange.com/questions/280725/pros-and-cons- of- bootstrapping#:~:text=It%20can%20be%20applied%20to,analysis% 20to%20name%20a%20few • https://towardsdatascience.com/bootstrapping-statistics-what-it-is- and-why-its-used- e2fa29577307#:~:text=%E2%80%9CThe%20advantages%20of%20b ootstrapping%20are,other%20groups%20of%20sampled%20data • https://www.analyticsvidhya.com/blog/2020/02/what-is-bootstrap- sampling-in-statistics-and-machine- learning/#:~:text=The%20advantage%20of%20bootstrap%20sampli ng%20is%20that%20it%20allows%20for,helping%20to%20quantify %20its%20uncertainty • https://www.linkedin.com/advice/1/what-advantages- disadvantages-bootstrapping- data#:~:text=It%20has%20several%20advantages%2C%20such%20a s%20being,understand%20without%20requiring%20complex%20for mulas%20or%20calculations • https://www.lancaster.ac.uk/stor-i-student-sites/katie- howgate/2021/02/05/bootstrapping/#:~:text=A%20key%20advanta ge%20is%20that,the%20information%20you%20actually%20have • https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#:~:text=thi s%20template%20message)- ,Advantages,Odds%20ratio%2C%20and%20correlation%20coefficie nts 05-01-2024 B O O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 36
  • 37.
    Thank You 05-01-2024 BO O T S T R A P P I N G I N S T A T I S T I C S - S U V A M D A S 37