Week_3_Lecture.pdf

Fundamentals of
Quantitative Research
Methods and Data Analytics
BUS159
Week 3

1. Bootstrapping
2. Introduction to Regression
3. Simple Linear Regression
4. Summary and Regression Analysis in R
4.1. Formula and Basics
4.2. Examples of Data and Problem
4.3. Visualisation
4.4. Computation
4.5. Interpretation
4.6. Regression Line
4.7. Model Assessment
Content

Mid-term Coursework Assignment: 30% of the overall mark
▪ List of five exercises to be performed remotely within a 24-hours
period.
▪ Deadline: 18/03/2022 at 10:00am
Final Coursework Assignment: 70% of the overall mark
▪ Report showing a competent application of quantitative methods
and data analysis concepts learned in our module, exploring a topic
of your own interest.
▪ Word limit: 2000 words.
▪ Deadline: 22/04/2022 at 10:00am
Assessment Profile

1. Instructions and Guidance
• In this Report (2000 words), please proceed as follows:
• Select a topic which you are really interested exploring.
In the case you want me to select your, that is completely fine, please just inform
me about this and I will provide you a topic to be explored.
• Decide which research question are you going to address.
• Collect data related to your topic and research question.
• Decide which quantitative research method(s) are you going to adopt.
• Perform data analyses applying quantitative research method(s) learnt in this
module on your data using R/ R Studio.
• Detail the method(s) adopted and discuss your findings in your individual Report.
Final Coursework Assignment 70%

The structure of this Report should consist of the following brief sections:
• Section 1. Introduction: Briefly mention your topic, question, input data, and
analyses performed;
• Section 2. Data: Detail your dataset, including data source, temporal coverage,
sample size;
• Section 3. Results: Describe the quantitative research methods adopted and data
analyses performed, reporting your results using a complementary chart and
table, discussing your findings;
• Section 4. Conclusion: Summarise your Report, briefly describing the main
quantitative research method adopted as well as your most relevant/ interesting
finding.
• Appendix. Attach an image/ figure (e.g. code print screen) evidencing that you
performed your data analyses using R/ R Studio.
Final Coursework Assignment 70% (Cont.)

2. Assessment Rubric with Weighted Criteria
• Following the structure of the Report, five rubrics are assessed, each
item contributing with its respective weight to this coursework
assignment overall mark (totalling 100 points), as follows:
• Section 1. Introduction – weight: 15% of the coursework assignment overall
mark;
• Section 2. Data – weight: 20% of the coursework assignment overall mark;
• Section 3. Results – weight: 40% of the coursework assignment overall mark;
• Section 4. Conclusion – weight: 15% of the coursework assignment overall mark;
• Appendix – weight: 10% of the coursework assignment overall mark.
Final Coursework Assignment 70% (Cont.)

Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Bootstrapping

• Bootstrapping is a statistical procedure that resamples a single dataset
to create many simulated samples.
• This process allows to calculate standard errors, build confidence
intervals, and perform hypothesis testing.
• Both bootstrapping and traditional methods use samples to draw
inferences about populations.
• To accomplish this goal, these procedures treat the single sample that a
study obtains as only one of many random samples that the study could
have collected.
• From a single sample, one can calculate a variety of sample statistics,
such as the mean, median, standard deviation.
Source: https://statisticsbyjim.com/hypothesis-testing/bootstrapping/
Bootstrapping

Source: https://towardsdatascience.com/introduction-to-bootstrapping-in-data-science-part-1-6e3483636f67
Bootstrapping (Cont.)

Basics of Data
observations.
Introduction to
Regression

Introduction to Regression
Variations of regression analysis
• Simple: One dependent variable (y), the variable to be
predicted, and one independent variable (x)
• Multiple: Two or more independent variables
• Linear: a liner (“straight-line”) connection between
variables
• Nonlinear: More connection – and related formulas -
between variables
Regression analysis aims to identify a mathematical
function that relates two or more variables, so that the
value of one variable may be predicted from given
values of the other(s)

A Simple Linear Relationship
y
x
Intercept a
Slope b
y = a + bx
1
b
Introduction to Regression (Cont)

Basics of Data
observations.
Simple Linear
Regression

Basic Concept
• Simple linear regression uses one independent (x) and
one dependent variable (y) and produces a straight line.
• Indicates to what extent the variables are associated;
it does not show cause-and-effect.
Scatter Diagram
• Data plot with the y variable on the vertical axis and the
x variable on the horizontal axis.
Fitting a Line to the Data
• Generally, the line will not fit the data perfectly.
• Need to find the “best-fitting” line.
Simple Linear Regression

Objective:
min σ 𝑑𝑖
2 = min σ(𝑦𝑖 − ෝ
𝑦𝑖)2
where
yi = observed value of the dependent variable
ෝ
𝑦𝑖 = estimated value of the dependent variable
The least squares criterion identifies the best fitting
line as the line that minimizes the sum of the
squared vertical distances of points from the line
Simple Linear Regression (Cont)

y
x
d1
d2
d3
d4

▪ The slope of the least squares line is calculated as follows:
▪ The intercept of the least squares line is calculated as follows:
where
x = values of the independent variable
y = values of the dependent variable
ഥ
𝑥 = mean of the x values
ഥ
𝑦 = mean of the y values
n = the number of points (observations)
( )
n xy x y
n x x
  


−
−
2 2
y bx
−
b =
a =

The estimated regression equation as defined as follows:
where
ො
𝑦 = estimated value of y for a given value of x
a = intercept
b = slope
ŷ = a + bx

Basics of Data
observations.
Summary and
Regression
Analysis in R

• The simple linear regression is used to predict a
quantitative outcome y on the basis of one single predictor
variable x.
• The objective is to formulate a model that defines y as a
function of the x variable.
• Once we built a statistically significant model, it is then
possible to use it for predicting future outcome on the
basis of new x values.
• Consider that, we want to evaluate the impact of
advertising budgets of three medias (YouTube, Facebook
and newspaper) on future sales.
• This example of problem can be modeled with linear
regression in R.
Simple Linear Regression in R
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem

• The mathematical formula of the linear regression can be
written as y = b0 + b1*x + e, where:
▪ b0 and b1 are known as the regression beta coefficients
or parameters, as follows:
▪ b0 is the intercept of the regression line, consisting of
the predicted value when x = 0.
▪ b1 is the slope of the regression line.
▪ e is the error term - also known as the residual errors,
which refers to the part of y that cannot be explained by
the regression model.
Formula and Basics

• The figure below illustrates the linear regression model,
where:
▪The best-fit regression line is in blue
▪The intercept b0 and the slope b1 are shown in green
▪The error terms (e) are represented by vertical red lines
Formula and Basics (Cont.)

• From the scatter plot, it can be seen that not all the
data points fall exactly on the fitted regression line.
• Some of the points are above the blue curve and
some are below it.
• Overall, the residual errors (e) have approximately
mean zero.
• The sum of the squares of the residual errors are
called the Residual Sum of Squares or RSS.
• The average variation of points around the fitted
regression line is called the Residual Standard
Error (RSE).
• This is one the metrics used to evaluate the overall
quality of the fitted regression model.
• The lower the RSE, the better it is.

• Since the mean error term is zero, the outcome variable y can
be approximately estimated as follow:
y ~ b0 + b1*x
• Mathematically, the beta coefficients (b0 and b1) are
determined so that the RSS is as minimal as possible.
• This method of determining the beta coefficients is called
least squares regression or ordinary least squares (OLS)
regression.
• Once, the beta coefficients are calculated, a t-test is then
performed to check whether or not these coefficients are
statistically significantly different from zero.
• Non-zero beta coefficients means that there is a statistically
significant relationship between the predictors (x) and the
outcome variable (y).

Load the following required packages:
• tidyverse: For data manipulation and visualisation
• ggpubr: Creates easily a publication ready-plot
Loading Required R Packages

• We’ll use the marketing data set [datarium package]. It
contains the impact of three advertising medias (YouTube,
Facebook and newspaper) on sales.
• Data are the advertising budget in thousands of pounds along
with the sales.
• The advertising experiment has been repeated 200 times
with different budgets and the observed sales have been
recorded.
• Firstly, install the datarium package
using devtools::install_github("kassmbara/datarium")
Examples of Data and Problem

• Then load and inspect the marketing data as follow:
• We want to predict future sales on the basis of advertising budget
spent on YouTube.
Examples of Data and Problem (Cont.)

• Let’s create a scatter plot displaying the sales units versus YouTube advertising
budget.
• In addition, let’s add a smoothed line, using the following code:
• This graph suggests a linearly
increasing relationship between the sales
and the YouTube variables.
• This is good because one
important assumption of the linear
regression is that the relationship between
the outcome and predictor variables is
linear and additive.
Visualisation

• Let’s also compute the correlation coefficient between the two variables using the R
function cor()
• The correlation coefficient measures the level of the association between two
variables x and y, ranging between -1 (perfect negative correlation: when x
increases, y decreases) and +1 (perfect positive correlation: when x increases, y
increases).
• A value closer to 0 suggests a weak relationship between the variables.
• A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the
outcome variable (y) is not explained by the predictor (x).
In such case, we should probably look for better predictor variables
• In our example, the correlation coefficient is large enough, so we can continue by
building a linear model of y as a function of x.
Examples of Data and Problem (Cont.)

• The simple linear regression tries to find the best line to predict sales on the
basis of YouTube advertising budget.
• The linear model equation can be written as follow
sales = b0 + b1 * youtube
• The R function lm() can be used to determine the beta coefficients of the linear
model, as follows:
• The results show the intercept and the beta coefficient for the YouTube variable.
Computation

From the output on the previous slide we have the following:
• The estimated regression line equation can be written as follow:
sales = 8.44 + 0.048*youtube
• The intercept b0 is 8.44. It can be interpreted as the predicted sales unit for a zero
YouTube advertising budget.
• Recall that we are operating in units of thousand pounds. This means that, for a YouTube
advertising budget equal zero, we can then expect a sale of
8.44 *1,000 = 8,440 pounds
• The regression beta coefficient for the variable YouTube b1, also known as the slope, is
0.048.
This means that, for a YouTube advertising budget equal to 1,000 pounds, we can expect
an increase of 48 units (0.048*1,000) in sales. That is:
sales = 8.44 + 0.048*1000 = 56.44 units.
• As we are operating in units of thousand pounds, this represents a sale of 56,440
pounds.
Interpretation

• To add the regression line onto the scatter
plot, you can use the
function stat_smooth() [ggplot2].
• By default, the fitted line is presented with
confidence interval around it. The
confidence bands reflect the uncertainty
about the line.
• If you don’t want to display it, specify the
option se = FALSE in the
function stat_smooth().
Regression Line

• In the previous slides, we built a linear model of sales as a
function of YouTube advertising budget:
sales = 8.44 + 0.048*youtube
• Before using this formula to predict future sales, you should make
sure that this model is statistically significant, that is:
▪There is a statistically significant relationship between the
predictor and the outcome variables
▪The model that we built fits very well the data in our hand.
• Therefore, in the next slides we explain how to check the quality
of a linear regression model.
Model Assessment

• We start by displaying the statistical summary of the model using
the R function summary()
Model Summary
The R summary outputs shows 6 components
include:
Call. Shows the function call used to compute the
regression model.
Residuals. Provide a quick view of the distribution
of the residuals, which by definition have a mean
zero. Therefore, the median should not be far from
zero, and the minimum and maximum should be
roughly equal in absolute value.
Coefficients. Shows the regression beta
coefficients and their statistical significance.
Predictor variables, that are significantly associated
to the outcome variable, are marked by stars.
Residual standard error (RSE), R-squared (R2)
and the F-statistic are metrics that are used to
check how well the model fits to our data.

• The coefficients table, in the model statistical summary, shows:
▪The estimates of the beta coefficients.
▪The standard errors (SE), which defines the accuracy of beta
coefficients. For a given beta coefficient, the SE reflects how the
coefficient varies under repeated sampling. It can be used to
compute the confidence intervals and the t-statistic.
▪The t-statistic and the associated p-value, which defines the
statistical significance of the beta coefficients.
Coefficients Significance

• For a given predictor, the t-statistic (and its associated p-value) tests
whether or not there is a statistically significant relationship between a
given predictor and the outcome variable.
• The statistical hypotheses are as follow:
• Null hypothesis (H0): The coefficients are equal to zero (i.e. no relationship
between x and y)
• Alternative Hypothesis (Ha): The coefficients are not equal to zero (i.e. there is
some relationship between x and y)
• Mathematically, for a given beta coefficient (b), the t-test is computed as
t = (b - 0)/SE(b), where SE(b) is the standard error of the coefficient b.
• The t-statistic measures the number of standard deviations that b is
away from 0. Therefore, a large t-statistic produces a small p-value.
t-statistic and p-values

• The larger the t-statistic – and, consequently, the lower the p-
value, the more significant the predictor.
• The symbols to the right visually specifies the level of significance.
The line below the table shows the definition of these symbols.
For example, one star means 0.01 < p < 0.05. The more the stars
beside the variable’s p-value, the more significant the variable.
• A statistically significant coefficient indicates that there is a
statistically significant association between the predictor (x) and
the outcome (y) variable.
t-statistic and p-values (Cont.)

• In our example, both the p-values for the intercept and the
predictor variable are highly significant.
• Thus, we can reject the null hypothesis and accept the alternative
hypothesis, which means that there is a significant association
between the predictor and the outcome variables.
• The t-statistic is a very useful guide for whether or not to include
a predictor in a model. High t-statistics (i.e. low p-values near 0)
indicate that a predictor should be retained in a model, while very
low t-statistics indicate a predictor variable could be dropped.
t-statistic and p-values (Cont.)

• The standard error measures the variability/accuracy of the beta
coefficients.
• It can be used to compute the confidence intervals of the
coefficients.
• For example, the 95% confidence interval for the coefficient b1 is
defined as b1 +/- 2*SE(b1), where:
▪ The lower limits of b1 = b1 - 2*SE(b1) = 0.047 - 2*0.00269 = 0.042
▪ The upper limits of b1 = b1 + 2*SE(b1) = 0.047 + 2*0.00269 = 0.052
• That is, there is approximately a 95% chance that the interval
[0.042, 0.052] will contain the true value of b1.
Standard Errors and Confidence Intervals

• Once you identified that, at least, one predictor variable is
significantly associated the outcome, you should continue the
diagnostic by checking how well the model fits the data.
• This process is also referred to as the goodness-of-fit
• The overall quality of the linear regression fit can be assessed
using the following three quantities, displayed in the model
summary:
1. Residual Standard Error (RSE).
2. R-squared (R2)
3. F-statistic
Model Accuracy
1 2 3

• The RSE (also known as the model sigma) is the residual variation,
representing the average variation of the observations points
around the fitted regression line.
• This is the standard deviation of residual errors.
• RSE provides an absolute measure of patterns in the data that
cannot be explained by the model.
• When comparing two models, the model with the smaller RSE is a
good indication that this model fits better the data.
• Dividing the RSE by the average value of the outcome variable
results in the prediction error rate, which should be as small as
possible.
Model Accuracy 1: Residual Standard Error
(RSE)

• In our example, RSE = 3.91, meaning that the observed sales
values deviate from the true regression line by approximately 3.9
units on average.
• Whether or not an RSE of 3.9 units is an acceptable prediction
error is subjective and depends on the problem context.
• However, we can calculate the percentage error. In our data set,
the mean value of sales is 16.827, and so the percentage error is
3.9/16.827 = 23%.
Model Accuracy 1: Residual Standard Error
(RSE) (Cont.)

• The R-squared (R2) ranges from 0 to 1 and represents the proportion of
information (i.e. variation) in the data that can be explained by the
model.
• The adjusted R-squared adjusts for the degrees of freedom.
• The R2 measures, how well the model fits the data.
• For a simple linear regression, R2 is the square of the Pearson
correlation coefficient.
• A large value of R2 is a good indication. However, as the value of R2
tends to increase when more predictors are added in the model, such as
in multiple linear regression model, you should mainly consider the
adjusted R-squared, which is a penalized R2 for a higher number of
predictors.
▪ An (adjusted) R2 that is close to 1 indicates that a large proportion of the
variability in the outcome has been explained by the regression model.
▪ A number near 0 indicates that the regression model did not explain much of the
variability in the outcome.
Model Accuracy 2: R-squared and Adjusted R-
squared

• The F-statistic gives the overall significance of the model. It assess
whether at least one predictor variable has a non-zero coefficient.
• In a simple linear regression, this test is not really interesting
since it just duplicates the information in given by the t-test,
available in the coefficient table.
• In fact, the F-test is identical to the square of the t-test: 312.1 =
(17.67)^2. This is true in any model with 1 degree of freedom.
• The F-statistic becomes more important once we start using
multiple predictors as in multiple linear regression.
• A large F-statistic corresponds to a statistically significant p-value
(p < 0.05). In our example, the F-statistic equals 312.14, producing
a p-value of 1.46e-42 (or
0.00000000000000000000000000000000000000000146),
which is highly significant.
Model Accuracy 3: F-statistic

• After computing a regression model, a first step is to check
whether, at least, one predictor is significantly associated with
outcome variables.
• If one or more predictors are significant, the second step is to
assess how well the model fits the data by inspecting the
Residuals Standard Error (RSE), the R2 value and the F-statistics.
• These metrics give the overall quality of the model.
Summary
Residual Standard Error (RSE) Closer to zero the better
R-Squared Larger the better
F-statistic Larger the better

• Bootstrapping is a popular statistical procedure that resamples a single dataset to create many
simulated samples in order to calculate standard errors, build confidence intervals, and
perform hypothesis testing.
• Simple linear regression is used to predict a quantitative outcome y on the basis of one single
predictor variable x.
• The objective is to formulate a model that defines y as a function of the x variable.
• Once we built a statistically significant model, it is then possible to use it for predicting future
outcome on the basis of new x values.
• The mathematical formula of the linear regression can be written as y = b0 + b1*x + e, where
b0 and b1 are the regression parameters and e is the error term that refers to the part of y that
cannot be explained by the regression model.
• The larger the t-statistic – and, consequently, the lower the p-value, the more significant the
predictor variable x.
• The overall quality of the linear regression fit can be assessed using the following three
quantities: Residual Standard Error RSE (the close to zero the better), R-squared (the larger the
better), and F-statistic (the larger the better).
Takeaways

• Brooks, C. (2019). Introductory Econometrics for Finance. Cambridge University Press.
• Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly
Media.
• Evans, J. R., Olson, D. L., & Olson, D. L. (2007). Statistics, Data Analysis, and Decision
Modeling. New Jersey: Pearson/Prentice Hall.
• Freed, N., Jones, S., & Bergquist, T. (2013). Understanding Business Statistics. Wiley Global
Education.
• http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-
regression-in-r/#examples-of-data-and-problem
• James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction
to Statistical Learning: With Applications in R. Springer Publishing Company,
Incorporated.
• Render, B., Stair Jr, R. M., Hanna, M. E., & Hale, T. S. (2018). Quantitative Analysis for
Management, 13e. Prentice Hall.
References

Basics of Data
observations.
Any Questions?

Basics of Data
observations.
Thank You!

Week_3_Lecture.pdf

Recommended

Recommended

More Related Content

Similar to Week_3_Lecture.pdf

Similar to Week_3_Lecture.pdf (20)

Recently uploaded

Recently uploaded (20)

Week_3_Lecture.pdf