SlideShare a Scribd company logo
1 of 49
Download to read offline
Fundamentals of
Quantitative Research
Methods and Data Analytics
BUS159
Week 3
1. Bootstrapping
2. Introduction to Regression
3. Simple Linear Regression
4. Summary and Regression Analysis in R
4.1. Formula and Basics
4.2. Examples of Data and Problem
4.3. Visualisation
4.4. Computation
4.5. Interpretation
4.6. Regression Line
4.7. Model Assessment
Content
Mid-term Coursework Assignment: 30% of the overall mark
▪ List of five exercises to be performed remotely within a 24-hours
period.
▪ Deadline: 18/03/2022 at 10:00am
Final Coursework Assignment: 70% of the overall mark
▪ Report showing a competent application of quantitative methods
and data analysis concepts learned in our module, exploring a topic
of your own interest.
▪ Word limit: 2000 words.
▪ Deadline: 22/04/2022 at 10:00am
Assessment Profile
1. Instructions and Guidance
• In this Report (2000 words), please proceed as follows:
• Select a topic which you are really interested exploring.
In the case you want me to select your, that is completely fine, please just inform
me about this and I will provide you a topic to be explored.
• Decide which research question are you going to address.
• Collect data related to your topic and research question.
• Decide which quantitative research method(s) are you going to adopt.
• Perform data analyses applying quantitative research method(s) learnt in this
module on your data using R/ R Studio.
• Detail the method(s) adopted and discuss your findings in your individual Report.
Final Coursework Assignment 70%
The structure of this Report should consist of the following brief sections:
• Section 1. Introduction: Briefly mention your topic, question, input data, and
analyses performed;
• Section 2. Data: Detail your dataset, including data source, temporal coverage,
sample size;
• Section 3. Results: Describe the quantitative research methods adopted and data
analyses performed, reporting your results using a complementary chart and
table, discussing your findings;
• Section 4. Conclusion: Summarise your Report, briefly describing the main
quantitative research method adopted as well as your most relevant/ interesting
finding.
• Appendix. Attach an image/ figure (e.g. code print screen) evidencing that you
performed your data analyses using R/ R Studio.
Final Coursework Assignment 70% (Cont.)
2. Assessment Rubric with Weighted Criteria
• Following the structure of the Report, five rubrics are assessed, each
item contributing with its respective weight to this coursework
assignment overall mark (totalling 100 points), as follows:
• Section 1. Introduction – weight: 15% of the coursework assignment overall
mark;
• Section 2. Data – weight: 20% of the coursework assignment overall mark;
• Section 3. Results – weight: 40% of the coursework assignment overall mark;
• Section 4. Conclusion – weight: 15% of the coursework assignment overall mark;
• Appendix – weight: 10% of the coursework assignment overall mark.
Final Coursework Assignment 70% (Cont.)
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Bootstrapping
• Bootstrapping is a statistical procedure that resamples a single dataset
to create many simulated samples.
• This process allows to calculate standard errors, build confidence
intervals, and perform hypothesis testing.
• Both bootstrapping and traditional methods use samples to draw
inferences about populations.
• To accomplish this goal, these procedures treat the single sample that a
study obtains as only one of many random samples that the study could
have collected.
• From a single sample, one can calculate a variety of sample statistics,
such as the mean, median, standard deviation.
Source: https://statisticsbyjim.com/hypothesis-testing/bootstrapping/
Bootstrapping
Source: https://towardsdatascience.com/introduction-to-bootstrapping-in-data-science-part-1-6e3483636f67
Bootstrapping (Cont.)
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Introduction to
Regression
Introduction to Regression
Variations of regression analysis
• Simple: One dependent variable (y), the variable to be
predicted, and one independent variable (x)
• Multiple: Two or more independent variables
• Linear: a liner (“straight-line”) connection between
variables
• Nonlinear: More connection – and related formulas -
between variables
Regression analysis aims to identify a mathematical
function that relates two or more variables, so that the
value of one variable may be predicted from given
values of the other(s)
A Simple Linear Relationship
y
x
Intercept a
Slope b
y = a + bx
1
b
Introduction to Regression (Cont)
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Simple Linear
Regression
Basic Concept
• Simple linear regression uses one independent (x) and
one dependent variable (y) and produces a straight line.
• Indicates to what extent the variables are associated;
it does not show cause-and-effect.
Scatter Diagram
• Data plot with the y variable on the vertical axis and the
x variable on the horizontal axis.
Fitting a Line to the Data
• Generally, the line will not fit the data perfectly.
• Need to find the “best-fitting” line.
Simple Linear Regression
Objective:
min σ 𝑑𝑖
2 = min σ(𝑦𝑖 − ෝ
𝑦𝑖)2
where
yi = observed value of the dependent variable
ෝ
𝑦𝑖 = estimated value of the dependent variable
The least squares criterion identifies the best fitting
line as the line that minimizes the sum of the
squared vertical distances of points from the line
Simple Linear Regression (Cont)
y
x
d1
d2
d3
d4
Simple Linear Regression (Cont)
▪ The slope of the least squares line is calculated as follows:
▪ The intercept of the least squares line is calculated as follows:
where
x = values of the independent variable
y = values of the dependent variable
ഥ
𝑥 = mean of the x values
ഥ
𝑦 = mean of the y values
n = the number of points (observations)
( )
n xy x y
n x x
  


−
−
2 2
y bx
−
b =
a =
Simple Linear Regression (Cont)
The estimated regression equation as defined as follows:
where
ො
𝑦 = estimated value of y for a given value of x
a = intercept
b = slope
ŷ = a + bx
Simple Linear Regression (Cont)
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Summary and
Regression
Analysis in R
• The simple linear regression is used to predict a
quantitative outcome y on the basis of one single predictor
variable x.
• The objective is to formulate a model that defines y as a
function of the x variable.
• Once we built a statistically significant model, it is then
possible to use it for predicting future outcome on the
basis of new x values.
• Consider that, we want to evaluate the impact of
advertising budgets of three medias (YouTube, Facebook
and newspaper) on future sales.
• This example of problem can be modeled with linear
regression in R.
Simple Linear Regression in R
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• The mathematical formula of the linear regression can be
written as y = b0 + b1*x + e, where:
▪ b0 and b1 are known as the regression beta coefficients
or parameters, as follows:
▪ b0 is the intercept of the regression line, consisting of
the predicted value when x = 0.
▪ b1 is the slope of the regression line.
▪ e is the error term - also known as the residual errors,
which refers to the part of y that cannot be explained by
the regression model.
Formula and Basics
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• The figure below illustrates the linear regression model,
where:
▪The best-fit regression line is in blue
▪The intercept b0 and the slope b1 are shown in green
▪The error terms (e) are represented by vertical red lines
Formula and Basics (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• From the scatter plot, it can be seen that not all the
data points fall exactly on the fitted regression line.
• Some of the points are above the blue curve and
some are below it.
• Overall, the residual errors (e) have approximately
mean zero.
• The sum of the squares of the residual errors are
called the Residual Sum of Squares or RSS.
• The average variation of points around the fitted
regression line is called the Residual Standard
Error (RSE).
• This is one the metrics used to evaluate the overall
quality of the fitted regression model.
• The lower the RSE, the better it is.
Formula and Basics (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• Since the mean error term is zero, the outcome variable y can
be approximately estimated as follow:
y ~ b0 + b1*x
• Mathematically, the beta coefficients (b0 and b1) are
determined so that the RSS is as minimal as possible.
• This method of determining the beta coefficients is called
least squares regression or ordinary least squares (OLS)
regression.
• Once, the beta coefficients are calculated, a t-test is then
performed to check whether or not these coefficients are
statistically significantly different from zero.
• Non-zero beta coefficients means that there is a statistically
significant relationship between the predictors (x) and the
outcome variable (y).
Formula and Basics (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
Load the following required packages:
• tidyverse: For data manipulation and visualisation
• ggpubr: Creates easily a publication ready-plot
Loading Required R Packages
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• We’ll use the marketing data set [datarium package]. It
contains the impact of three advertising medias (YouTube,
Facebook and newspaper) on sales.
• Data are the advertising budget in thousands of pounds along
with the sales.
• The advertising experiment has been repeated 200 times
with different budgets and the observed sales have been
recorded.
• Firstly, install the datarium package
using devtools::install_github("kassmbara/datarium")
Examples of Data and Problem
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• Then load and inspect the marketing data as follow:
• We want to predict future sales on the basis of advertising budget
spent on YouTube.
Examples of Data and Problem (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• Let’s create a scatter plot displaying the sales units versus YouTube advertising
budget.
• In addition, let’s add a smoothed line, using the following code:
• This graph suggests a linearly
increasing relationship between the sales
and the YouTube variables.
• This is good because one
important assumption of the linear
regression is that the relationship between
the outcome and predictor variables is
linear and additive.
Visualisation
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• Let’s also compute the correlation coefficient between the two variables using the R
function cor()
• The correlation coefficient measures the level of the association between two
variables x and y, ranging between -1 (perfect negative correlation: when x
increases, y decreases) and +1 (perfect positive correlation: when x increases, y
increases).
• A value closer to 0 suggests a weak relationship between the variables.
• A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the
outcome variable (y) is not explained by the predictor (x).
In such case, we should probably look for better predictor variables
• In our example, the correlation coefficient is large enough, so we can continue by
building a linear model of y as a function of x.
Examples of Data and Problem (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• The simple linear regression tries to find the best line to predict sales on the
basis of YouTube advertising budget.
• The linear model equation can be written as follow
sales = b0 + b1 * youtube
• The R function lm() can be used to determine the beta coefficients of the linear
model, as follows:
• The results show the intercept and the beta coefficient for the YouTube variable.
Computation
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
From the output on the previous slide we have the following:
• The estimated regression line equation can be written as follow:
sales = 8.44 + 0.048*youtube
• The intercept b0 is 8.44. It can be interpreted as the predicted sales unit for a zero
YouTube advertising budget.
• Recall that we are operating in units of thousand pounds. This means that, for a YouTube
advertising budget equal zero, we can then expect a sale of
8.44 *1,000 = 8,440 pounds
• The regression beta coefficient for the variable YouTube b1, also known as the slope, is
0.048.
This means that, for a YouTube advertising budget equal to 1,000 pounds, we can expect
an increase of 48 units (0.048*1,000) in sales. That is:
sales = 8.44 + 0.048*1000 = 56.44 units.
• As we are operating in units of thousand pounds, this represents a sale of 56,440
pounds.
Interpretation
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• To add the regression line onto the scatter
plot, you can use the
function stat_smooth() [ggplot2].
• By default, the fitted line is presented with
confidence interval around it. The
confidence bands reflect the uncertainty
about the line.
• If you don’t want to display it, specify the
option se = FALSE in the
function stat_smooth().
Regression Line
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• In the previous slides, we built a linear model of sales as a
function of YouTube advertising budget:
sales = 8.44 + 0.048*youtube
• Before using this formula to predict future sales, you should make
sure that this model is statistically significant, that is:
▪There is a statistically significant relationship between the
predictor and the outcome variables
▪The model that we built fits very well the data in our hand.
• Therefore, in the next slides we explain how to check the quality
of a linear regression model.
Model Assessment
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• We start by displaying the statistical summary of the model using
the R function summary()
Model Summary
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
The R summary outputs shows 6 components
include:
Call. Shows the function call used to compute the
regression model.
Residuals. Provide a quick view of the distribution
of the residuals, which by definition have a mean
zero. Therefore, the median should not be far from
zero, and the minimum and maximum should be
roughly equal in absolute value.
Coefficients. Shows the regression beta
coefficients and their statistical significance.
Predictor variables, that are significantly associated
to the outcome variable, are marked by stars.
Residual standard error (RSE), R-squared (R2)
and the F-statistic are metrics that are used to
check how well the model fits to our data.
• The coefficients table, in the model statistical summary, shows:
▪The estimates of the beta coefficients.
▪The standard errors (SE), which defines the accuracy of beta
coefficients. For a given beta coefficient, the SE reflects how the
coefficient varies under repeated sampling. It can be used to
compute the confidence intervals and the t-statistic.
▪The t-statistic and the associated p-value, which defines the
statistical significance of the beta coefficients.
Coefficients Significance
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• For a given predictor, the t-statistic (and its associated p-value) tests
whether or not there is a statistically significant relationship between a
given predictor and the outcome variable.
• The statistical hypotheses are as follow:
• Null hypothesis (H0): The coefficients are equal to zero (i.e. no relationship
between x and y)
• Alternative Hypothesis (Ha): The coefficients are not equal to zero (i.e. there is
some relationship between x and y)
• Mathematically, for a given beta coefficient (b), the t-test is computed as
t = (b - 0)/SE(b), where SE(b) is the standard error of the coefficient b.
• The t-statistic measures the number of standard deviations that b is
away from 0. Therefore, a large t-statistic produces a small p-value.
t-statistic and p-values
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• The larger the t-statistic – and, consequently, the lower the p-
value, the more significant the predictor.
• The symbols to the right visually specifies the level of significance.
The line below the table shows the definition of these symbols.
For example, one star means 0.01 < p < 0.05. The more the stars
beside the variable’s p-value, the more significant the variable.
• A statistically significant coefficient indicates that there is a
statistically significant association between the predictor (x) and
the outcome (y) variable.
t-statistic and p-values (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• In our example, both the p-values for the intercept and the
predictor variable are highly significant.
• Thus, we can reject the null hypothesis and accept the alternative
hypothesis, which means that there is a significant association
between the predictor and the outcome variables.
• The t-statistic is a very useful guide for whether or not to include
a predictor in a model. High t-statistics (i.e. low p-values near 0)
indicate that a predictor should be retained in a model, while very
low t-statistics indicate a predictor variable could be dropped.
t-statistic and p-values (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• The standard error measures the variability/accuracy of the beta
coefficients.
• It can be used to compute the confidence intervals of the
coefficients.
• For example, the 95% confidence interval for the coefficient b1 is
defined as b1 +/- 2*SE(b1), where:
▪ The lower limits of b1 = b1 - 2*SE(b1) = 0.047 - 2*0.00269 = 0.042
▪ The upper limits of b1 = b1 + 2*SE(b1) = 0.047 + 2*0.00269 = 0.052
• That is, there is approximately a 95% chance that the interval
[0.042, 0.052] will contain the true value of b1.
Standard Errors and Confidence Intervals
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• Once you identified that, at least, one predictor variable is
significantly associated the outcome, you should continue the
diagnostic by checking how well the model fits the data.
• This process is also referred to as the goodness-of-fit
• The overall quality of the linear regression fit can be assessed
using the following three quantities, displayed in the model
summary:
1. Residual Standard Error (RSE).
2. R-squared (R2)
3. F-statistic
Model Accuracy
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
1 2 3
• The RSE (also known as the model sigma) is the residual variation,
representing the average variation of the observations points
around the fitted regression line.
• This is the standard deviation of residual errors.
• RSE provides an absolute measure of patterns in the data that
cannot be explained by the model.
• When comparing two models, the model with the smaller RSE is a
good indication that this model fits better the data.
• Dividing the RSE by the average value of the outcome variable
results in the prediction error rate, which should be as small as
possible.
Model Accuracy 1: Residual Standard Error
(RSE)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• In our example, RSE = 3.91, meaning that the observed sales
values deviate from the true regression line by approximately 3.9
units on average.
• Whether or not an RSE of 3.9 units is an acceptable prediction
error is subjective and depends on the problem context.
• However, we can calculate the percentage error. In our data set,
the mean value of sales is 16.827, and so the percentage error is
3.9/16.827 = 23%.
Model Accuracy 1: Residual Standard Error
(RSE) (Cont.)
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• The R-squared (R2) ranges from 0 to 1 and represents the proportion of
information (i.e. variation) in the data that can be explained by the
model.
• The adjusted R-squared adjusts for the degrees of freedom.
• The R2 measures, how well the model fits the data.
• For a simple linear regression, R2 is the square of the Pearson
correlation coefficient.
• A large value of R2 is a good indication. However, as the value of R2
tends to increase when more predictors are added in the model, such as
in multiple linear regression model, you should mainly consider the
adjusted R-squared, which is a penalized R2 for a higher number of
predictors.
▪ An (adjusted) R2 that is close to 1 indicates that a large proportion of the
variability in the outcome has been explained by the regression model.
▪ A number near 0 indicates that the regression model did not explain much of the
variability in the outcome.
Model Accuracy 2: R-squared and Adjusted R-
squared
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• The F-statistic gives the overall significance of the model. It assess
whether at least one predictor variable has a non-zero coefficient.
• In a simple linear regression, this test is not really interesting
since it just duplicates the information in given by the t-test,
available in the coefficient table.
• In fact, the F-test is identical to the square of the t-test: 312.1 =
(17.67)^2. This is true in any model with 1 degree of freedom.
• The F-statistic becomes more important once we start using
multiple predictors as in multiple linear regression.
• A large F-statistic corresponds to a statistically significant p-value
(p < 0.05). In our example, the F-statistic equals 312.14, producing
a p-value of 1.46e-42 (or
0.00000000000000000000000000000000000000000146),
which is highly significant.
Model Accuracy 3: F-statistic
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
• After computing a regression model, a first step is to check
whether, at least, one predictor is significantly associated with
outcome variables.
• If one or more predictors are significant, the second step is to
assess how well the model fits the data by inspecting the
Residuals Standard Error (RSE), the R2 value and the F-statistics.
• These metrics give the overall quality of the model.
Summary
Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
Residual Standard Error (RSE) Closer to zero the better
R-Squared Larger the better
F-statistic Larger the better
• Bootstrapping is a popular statistical procedure that resamples a single dataset to create many
simulated samples in order to calculate standard errors, build confidence intervals, and
perform hypothesis testing.
• Simple linear regression is used to predict a quantitative outcome y on the basis of one single
predictor variable x.
• The objective is to formulate a model that defines y as a function of the x variable.
• Once we built a statistically significant model, it is then possible to use it for predicting future
outcome on the basis of new x values.
• The mathematical formula of the linear regression can be written as y = b0 + b1*x + e, where
b0 and b1 are the regression parameters and e is the error term that refers to the part of y that
cannot be explained by the regression model.
• The larger the t-statistic – and, consequently, the lower the p-value, the more significant the
predictor variable x.
• The overall quality of the linear regression fit can be assessed using the following three
quantities: Residual Standard Error RSE (the close to zero the better), R-squared (the larger the
better), and F-statistic (the larger the better).
Takeaways
• Brooks, C. (2019). Introductory Econometrics for Finance. Cambridge University Press.
• Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly
Media.
• Evans, J. R., Olson, D. L., & Olson, D. L. (2007). Statistics, Data Analysis, and Decision
Modeling. New Jersey: Pearson/Prentice Hall.
• Freed, N., Jones, S., & Bergquist, T. (2013). Understanding Business Statistics. Wiley Global
Education.
• http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-
regression-in-r/#examples-of-data-and-problem
• James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction
to Statistical Learning: With Applications in R. Springer Publishing Company,
Incorporated.
• Render, B., Stair Jr, R. M., Hanna, M. E., & Hale, T. S. (2018). Quantitative Analysis for
Management, 13e. Prentice Hall.
References
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Any Questions?
Basics of Data
• Statistics is the science of data.
• Data consist of the facts or figures that are the subject of
summarisation, analysis, modelling, and presentation.
• A dataset is a collection of data with some common connection.
For instance, the GDP of European countries from 2010 to 2020.
• A variable is a particular characteristic of interest within a group of
observations.
For instance, the GDP of Germany.
• An observation (observational unit or case) is a particular value
comprising a variable.
An example can be the GDP of Germany in 2020.
Thank You!

More Related Content

Similar to Week_3_Lecture.pdf

Elementary Statistics (MATH220)Assignment Statistic.docx
Elementary Statistics (MATH220)Assignment Statistic.docxElementary Statistics (MATH220)Assignment Statistic.docx
Elementary Statistics (MATH220)Assignment Statistic.docxtoltonkendal
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdfthaersyam
 
QUANTITATIVE-DATA.pptx
QUANTITATIVE-DATA.pptxQUANTITATIVE-DATA.pptx
QUANTITATIVE-DATA.pptxViaFortuna
 
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docxSTAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docxrafaelaj1
 
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher TrainingONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher TrainingOffice for National Statistics
 
Econometrics _1.pptx
Econometrics _1.pptxEconometrics _1.pptx
Econometrics _1.pptxfuad80
 
Course Project AJ DAVIS DEPARTMENT STORESIntroduction.docx
Course Project AJ DAVIS DEPARTMENT STORESIntroduction.docxCourse Project AJ DAVIS DEPARTMENT STORESIntroduction.docx
Course Project AJ DAVIS DEPARTMENT STORESIntroduction.docxvanesaburnand
 
Data analysis plan in medicine and nurse.pptx
Data analysis plan in medicine and nurse.pptxData analysis plan in medicine and nurse.pptx
Data analysis plan in medicine and nurse.pptxJuma675663
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningQuantUniversity
 
Unit 3 – AIML.pptx
Unit 3 – AIML.pptxUnit 3 – AIML.pptx
Unit 3 – AIML.pptxhiblooms
 
ECO 510 Final Project Guidelines and Rubric Overview The final.docx
ECO 510 Final Project Guidelines and Rubric Overview The final.docxECO 510 Final Project Guidelines and Rubric Overview The final.docx
ECO 510 Final Project Guidelines and Rubric Overview The final.docxjack60216
 
Statistics online lecture 01.pptx
Statistics online lecture  01.pptxStatistics online lecture  01.pptx
Statistics online lecture 01.pptxIkramUlhaq93
 
Session 1 and 2.pptx
Session 1 and 2.pptxSession 1 and 2.pptx
Session 1 and 2.pptxAkshitMGoel
 
STAT200 Assignment #3 - Inferential Statistics Analysis an.docx
STAT200 Assignment #3 - Inferential Statistics Analysis  an.docxSTAT200 Assignment #3 - Inferential Statistics Analysis  an.docx
STAT200 Assignment #3 - Inferential Statistics Analysis an.docxrafaelaj1
 
cost estimation techniques
cost estimation techniquescost estimation techniques
cost estimation techniquesabid khaliq
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup SlidesQuantUniversity
 

Similar to Week_3_Lecture.pdf (20)

Elementary Statistics (MATH220)Assignment Statistic.docx
Elementary Statistics (MATH220)Assignment Statistic.docxElementary Statistics (MATH220)Assignment Statistic.docx
Elementary Statistics (MATH220)Assignment Statistic.docx
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf
 
QUANTITATIVE-DATA.pptx
QUANTITATIVE-DATA.pptxQUANTITATIVE-DATA.pptx
QUANTITATIVE-DATA.pptx
 
Module-2_ML.pdf
Module-2_ML.pdfModule-2_ML.pdf
Module-2_ML.pdf
 
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docxSTAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
STAT200 Assignment #2 - Descriptive Statistics Analysis and.docx
 
ANALYSIS OF DATA (2).pptx
ANALYSIS OF DATA (2).pptxANALYSIS OF DATA (2).pptx
ANALYSIS OF DATA (2).pptx
 
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher TrainingONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
 
Econometrics _1.pptx
Econometrics _1.pptxEconometrics _1.pptx
Econometrics _1.pptx
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Course Project AJ DAVIS DEPARTMENT STORESIntroduction.docx
Course Project AJ DAVIS DEPARTMENT STORESIntroduction.docxCourse Project AJ DAVIS DEPARTMENT STORESIntroduction.docx
Course Project AJ DAVIS DEPARTMENT STORESIntroduction.docx
 
Data analysis plan in medicine and nurse.pptx
Data analysis plan in medicine and nurse.pptxData analysis plan in medicine and nurse.pptx
Data analysis plan in medicine and nurse.pptx
 
An introduction to R
An introduction to RAn introduction to R
An introduction to R
 
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
 
Unit 3 – AIML.pptx
Unit 3 – AIML.pptxUnit 3 – AIML.pptx
Unit 3 – AIML.pptx
 
ECO 510 Final Project Guidelines and Rubric Overview The final.docx
ECO 510 Final Project Guidelines and Rubric Overview The final.docxECO 510 Final Project Guidelines and Rubric Overview The final.docx
ECO 510 Final Project Guidelines and Rubric Overview The final.docx
 
Statistics online lecture 01.pptx
Statistics online lecture  01.pptxStatistics online lecture  01.pptx
Statistics online lecture 01.pptx
 
Session 1 and 2.pptx
Session 1 and 2.pptxSession 1 and 2.pptx
Session 1 and 2.pptx
 
STAT200 Assignment #3 - Inferential Statistics Analysis an.docx
STAT200 Assignment #3 - Inferential Statistics Analysis  an.docxSTAT200 Assignment #3 - Inferential Statistics Analysis  an.docx
STAT200 Assignment #3 - Inferential Statistics Analysis an.docx
 
cost estimation techniques
cost estimation techniquescost estimation techniques
cost estimation techniques
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup Slides
 

Recently uploaded

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutionsmonugehlot87
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 

Recently uploaded (20)

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
buds n tech IT solutions
buds n  tech IT                solutionsbuds n  tech IT                solutions
buds n tech IT solutions
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 

Week_3_Lecture.pdf

  • 1. Fundamentals of Quantitative Research Methods and Data Analytics BUS159 Week 3
  • 2. 1. Bootstrapping 2. Introduction to Regression 3. Simple Linear Regression 4. Summary and Regression Analysis in R 4.1. Formula and Basics 4.2. Examples of Data and Problem 4.3. Visualisation 4.4. Computation 4.5. Interpretation 4.6. Regression Line 4.7. Model Assessment Content
  • 3. Mid-term Coursework Assignment: 30% of the overall mark ▪ List of five exercises to be performed remotely within a 24-hours period. ▪ Deadline: 18/03/2022 at 10:00am Final Coursework Assignment: 70% of the overall mark ▪ Report showing a competent application of quantitative methods and data analysis concepts learned in our module, exploring a topic of your own interest. ▪ Word limit: 2000 words. ▪ Deadline: 22/04/2022 at 10:00am Assessment Profile
  • 4. 1. Instructions and Guidance • In this Report (2000 words), please proceed as follows: • Select a topic which you are really interested exploring. In the case you want me to select your, that is completely fine, please just inform me about this and I will provide you a topic to be explored. • Decide which research question are you going to address. • Collect data related to your topic and research question. • Decide which quantitative research method(s) are you going to adopt. • Perform data analyses applying quantitative research method(s) learnt in this module on your data using R/ R Studio. • Detail the method(s) adopted and discuss your findings in your individual Report. Final Coursework Assignment 70%
  • 5. The structure of this Report should consist of the following brief sections: • Section 1. Introduction: Briefly mention your topic, question, input data, and analyses performed; • Section 2. Data: Detail your dataset, including data source, temporal coverage, sample size; • Section 3. Results: Describe the quantitative research methods adopted and data analyses performed, reporting your results using a complementary chart and table, discussing your findings; • Section 4. Conclusion: Summarise your Report, briefly describing the main quantitative research method adopted as well as your most relevant/ interesting finding. • Appendix. Attach an image/ figure (e.g. code print screen) evidencing that you performed your data analyses using R/ R Studio. Final Coursework Assignment 70% (Cont.)
  • 6. 2. Assessment Rubric with Weighted Criteria • Following the structure of the Report, five rubrics are assessed, each item contributing with its respective weight to this coursework assignment overall mark (totalling 100 points), as follows: • Section 1. Introduction – weight: 15% of the coursework assignment overall mark; • Section 2. Data – weight: 20% of the coursework assignment overall mark; • Section 3. Results – weight: 40% of the coursework assignment overall mark; • Section 4. Conclusion – weight: 15% of the coursework assignment overall mark; • Appendix – weight: 10% of the coursework assignment overall mark. Final Coursework Assignment 70% (Cont.)
  • 7. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Bootstrapping
  • 8. • Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. • This process allows to calculate standard errors, build confidence intervals, and perform hypothesis testing. • Both bootstrapping and traditional methods use samples to draw inferences about populations. • To accomplish this goal, these procedures treat the single sample that a study obtains as only one of many random samples that the study could have collected. • From a single sample, one can calculate a variety of sample statistics, such as the mean, median, standard deviation. Source: https://statisticsbyjim.com/hypothesis-testing/bootstrapping/ Bootstrapping
  • 10. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Introduction to Regression
  • 11. Introduction to Regression Variations of regression analysis • Simple: One dependent variable (y), the variable to be predicted, and one independent variable (x) • Multiple: Two or more independent variables • Linear: a liner (“straight-line”) connection between variables • Nonlinear: More connection – and related formulas - between variables Regression analysis aims to identify a mathematical function that relates two or more variables, so that the value of one variable may be predicted from given values of the other(s)
  • 12. A Simple Linear Relationship y x Intercept a Slope b y = a + bx 1 b Introduction to Regression (Cont)
  • 13. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Simple Linear Regression
  • 14. Basic Concept • Simple linear regression uses one independent (x) and one dependent variable (y) and produces a straight line. • Indicates to what extent the variables are associated; it does not show cause-and-effect. Scatter Diagram • Data plot with the y variable on the vertical axis and the x variable on the horizontal axis. Fitting a Line to the Data • Generally, the line will not fit the data perfectly. • Need to find the “best-fitting” line. Simple Linear Regression
  • 15. Objective: min σ 𝑑𝑖 2 = min σ(𝑦𝑖 − ෝ 𝑦𝑖)2 where yi = observed value of the dependent variable ෝ 𝑦𝑖 = estimated value of the dependent variable The least squares criterion identifies the best fitting line as the line that minimizes the sum of the squared vertical distances of points from the line Simple Linear Regression (Cont)
  • 17. ▪ The slope of the least squares line is calculated as follows: ▪ The intercept of the least squares line is calculated as follows: where x = values of the independent variable y = values of the dependent variable ഥ 𝑥 = mean of the x values ഥ 𝑦 = mean of the y values n = the number of points (observations) ( ) n xy x y n x x      − − 2 2 y bx − b = a = Simple Linear Regression (Cont)
  • 18. The estimated regression equation as defined as follows: where ො 𝑦 = estimated value of y for a given value of x a = intercept b = slope ŷ = a + bx Simple Linear Regression (Cont)
  • 19. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Summary and Regression Analysis in R
  • 20. • The simple linear regression is used to predict a quantitative outcome y on the basis of one single predictor variable x. • The objective is to formulate a model that defines y as a function of the x variable. • Once we built a statistically significant model, it is then possible to use it for predicting future outcome on the basis of new x values. • Consider that, we want to evaluate the impact of advertising budgets of three medias (YouTube, Facebook and newspaper) on future sales. • This example of problem can be modeled with linear regression in R. Simple Linear Regression in R Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 21. • The mathematical formula of the linear regression can be written as y = b0 + b1*x + e, where: ▪ b0 and b1 are known as the regression beta coefficients or parameters, as follows: ▪ b0 is the intercept of the regression line, consisting of the predicted value when x = 0. ▪ b1 is the slope of the regression line. ▪ e is the error term - also known as the residual errors, which refers to the part of y that cannot be explained by the regression model. Formula and Basics Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 22. • The figure below illustrates the linear regression model, where: ▪The best-fit regression line is in blue ▪The intercept b0 and the slope b1 are shown in green ▪The error terms (e) are represented by vertical red lines Formula and Basics (Cont.) Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 23. • From the scatter plot, it can be seen that not all the data points fall exactly on the fitted regression line. • Some of the points are above the blue curve and some are below it. • Overall, the residual errors (e) have approximately mean zero. • The sum of the squares of the residual errors are called the Residual Sum of Squares or RSS. • The average variation of points around the fitted regression line is called the Residual Standard Error (RSE). • This is one the metrics used to evaluate the overall quality of the fitted regression model. • The lower the RSE, the better it is. Formula and Basics (Cont.) Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 24. • Since the mean error term is zero, the outcome variable y can be approximately estimated as follow: y ~ b0 + b1*x • Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as minimal as possible. • This method of determining the beta coefficients is called least squares regression or ordinary least squares (OLS) regression. • Once, the beta coefficients are calculated, a t-test is then performed to check whether or not these coefficients are statistically significantly different from zero. • Non-zero beta coefficients means that there is a statistically significant relationship between the predictors (x) and the outcome variable (y). Formula and Basics (Cont.) Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 25. Load the following required packages: • tidyverse: For data manipulation and visualisation • ggpubr: Creates easily a publication ready-plot Loading Required R Packages Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 26. • We’ll use the marketing data set [datarium package]. It contains the impact of three advertising medias (YouTube, Facebook and newspaper) on sales. • Data are the advertising budget in thousands of pounds along with the sales. • The advertising experiment has been repeated 200 times with different budgets and the observed sales have been recorded. • Firstly, install the datarium package using devtools::install_github("kassmbara/datarium") Examples of Data and Problem Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 27. • Then load and inspect the marketing data as follow: • We want to predict future sales on the basis of advertising budget spent on YouTube. Examples of Data and Problem (Cont.) Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 28. • Let’s create a scatter plot displaying the sales units versus YouTube advertising budget. • In addition, let’s add a smoothed line, using the following code: • This graph suggests a linearly increasing relationship between the sales and the YouTube variables. • This is good because one important assumption of the linear regression is that the relationship between the outcome and predictor variables is linear and additive. Visualisation Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 29. • Let’s also compute the correlation coefficient between the two variables using the R function cor() • The correlation coefficient measures the level of the association between two variables x and y, ranging between -1 (perfect negative correlation: when x increases, y decreases) and +1 (perfect positive correlation: when x increases, y increases). • A value closer to 0 suggests a weak relationship between the variables. • A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the outcome variable (y) is not explained by the predictor (x). In such case, we should probably look for better predictor variables • In our example, the correlation coefficient is large enough, so we can continue by building a linear model of y as a function of x. Examples of Data and Problem (Cont.) Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 30. • The simple linear regression tries to find the best line to predict sales on the basis of YouTube advertising budget. • The linear model equation can be written as follow sales = b0 + b1 * youtube • The R function lm() can be used to determine the beta coefficients of the linear model, as follows: • The results show the intercept and the beta coefficient for the YouTube variable. Computation Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 31. From the output on the previous slide we have the following: • The estimated regression line equation can be written as follow: sales = 8.44 + 0.048*youtube • The intercept b0 is 8.44. It can be interpreted as the predicted sales unit for a zero YouTube advertising budget. • Recall that we are operating in units of thousand pounds. This means that, for a YouTube advertising budget equal zero, we can then expect a sale of 8.44 *1,000 = 8,440 pounds • The regression beta coefficient for the variable YouTube b1, also known as the slope, is 0.048. This means that, for a YouTube advertising budget equal to 1,000 pounds, we can expect an increase of 48 units (0.048*1,000) in sales. That is: sales = 8.44 + 0.048*1000 = 56.44 units. • As we are operating in units of thousand pounds, this represents a sale of 56,440 pounds. Interpretation Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 32. • To add the regression line onto the scatter plot, you can use the function stat_smooth() [ggplot2]. • By default, the fitted line is presented with confidence interval around it. The confidence bands reflect the uncertainty about the line. • If you don’t want to display it, specify the option se = FALSE in the function stat_smooth(). Regression Line Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 33. • In the previous slides, we built a linear model of sales as a function of YouTube advertising budget: sales = 8.44 + 0.048*youtube • Before using this formula to predict future sales, you should make sure that this model is statistically significant, that is: ▪There is a statistically significant relationship between the predictor and the outcome variables ▪The model that we built fits very well the data in our hand. • Therefore, in the next slides we explain how to check the quality of a linear regression model. Model Assessment Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 34. • We start by displaying the statistical summary of the model using the R function summary() Model Summary Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem The R summary outputs shows 6 components include: Call. Shows the function call used to compute the regression model. Residuals. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value. Coefficients. Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars. Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that are used to check how well the model fits to our data.
  • 35. • The coefficients table, in the model statistical summary, shows: ▪The estimates of the beta coefficients. ▪The standard errors (SE), which defines the accuracy of beta coefficients. For a given beta coefficient, the SE reflects how the coefficient varies under repeated sampling. It can be used to compute the confidence intervals and the t-statistic. ▪The t-statistic and the associated p-value, which defines the statistical significance of the beta coefficients. Coefficients Significance Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 36. • For a given predictor, the t-statistic (and its associated p-value) tests whether or not there is a statistically significant relationship between a given predictor and the outcome variable. • The statistical hypotheses are as follow: • Null hypothesis (H0): The coefficients are equal to zero (i.e. no relationship between x and y) • Alternative Hypothesis (Ha): The coefficients are not equal to zero (i.e. there is some relationship between x and y) • Mathematically, for a given beta coefficient (b), the t-test is computed as t = (b - 0)/SE(b), where SE(b) is the standard error of the coefficient b. • The t-statistic measures the number of standard deviations that b is away from 0. Therefore, a large t-statistic produces a small p-value. t-statistic and p-values Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 37. • The larger the t-statistic – and, consequently, the lower the p- value, the more significant the predictor. • The symbols to the right visually specifies the level of significance. The line below the table shows the definition of these symbols. For example, one star means 0.01 < p < 0.05. The more the stars beside the variable’s p-value, the more significant the variable. • A statistically significant coefficient indicates that there is a statistically significant association between the predictor (x) and the outcome (y) variable. t-statistic and p-values (Cont.) Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 38. • In our example, both the p-values for the intercept and the predictor variable are highly significant. • Thus, we can reject the null hypothesis and accept the alternative hypothesis, which means that there is a significant association between the predictor and the outcome variables. • The t-statistic is a very useful guide for whether or not to include a predictor in a model. High t-statistics (i.e. low p-values near 0) indicate that a predictor should be retained in a model, while very low t-statistics indicate a predictor variable could be dropped. t-statistic and p-values (Cont.) Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 39. • The standard error measures the variability/accuracy of the beta coefficients. • It can be used to compute the confidence intervals of the coefficients. • For example, the 95% confidence interval for the coefficient b1 is defined as b1 +/- 2*SE(b1), where: ▪ The lower limits of b1 = b1 - 2*SE(b1) = 0.047 - 2*0.00269 = 0.042 ▪ The upper limits of b1 = b1 + 2*SE(b1) = 0.047 + 2*0.00269 = 0.052 • That is, there is approximately a 95% chance that the interval [0.042, 0.052] will contain the true value of b1. Standard Errors and Confidence Intervals Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 40. • Once you identified that, at least, one predictor variable is significantly associated the outcome, you should continue the diagnostic by checking how well the model fits the data. • This process is also referred to as the goodness-of-fit • The overall quality of the linear regression fit can be assessed using the following three quantities, displayed in the model summary: 1. Residual Standard Error (RSE). 2. R-squared (R2) 3. F-statistic Model Accuracy Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem 1 2 3
  • 41. • The RSE (also known as the model sigma) is the residual variation, representing the average variation of the observations points around the fitted regression line. • This is the standard deviation of residual errors. • RSE provides an absolute measure of patterns in the data that cannot be explained by the model. • When comparing two models, the model with the smaller RSE is a good indication that this model fits better the data. • Dividing the RSE by the average value of the outcome variable results in the prediction error rate, which should be as small as possible. Model Accuracy 1: Residual Standard Error (RSE) Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 42. • In our example, RSE = 3.91, meaning that the observed sales values deviate from the true regression line by approximately 3.9 units on average. • Whether or not an RSE of 3.9 units is an acceptable prediction error is subjective and depends on the problem context. • However, we can calculate the percentage error. In our data set, the mean value of sales is 16.827, and so the percentage error is 3.9/16.827 = 23%. Model Accuracy 1: Residual Standard Error (RSE) (Cont.) Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 43. • The R-squared (R2) ranges from 0 to 1 and represents the proportion of information (i.e. variation) in the data that can be explained by the model. • The adjusted R-squared adjusts for the degrees of freedom. • The R2 measures, how well the model fits the data. • For a simple linear regression, R2 is the square of the Pearson correlation coefficient. • A large value of R2 is a good indication. However, as the value of R2 tends to increase when more predictors are added in the model, such as in multiple linear regression model, you should mainly consider the adjusted R-squared, which is a penalized R2 for a higher number of predictors. ▪ An (adjusted) R2 that is close to 1 indicates that a large proportion of the variability in the outcome has been explained by the regression model. ▪ A number near 0 indicates that the regression model did not explain much of the variability in the outcome. Model Accuracy 2: R-squared and Adjusted R- squared Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 44. • The F-statistic gives the overall significance of the model. It assess whether at least one predictor variable has a non-zero coefficient. • In a simple linear regression, this test is not really interesting since it just duplicates the information in given by the t-test, available in the coefficient table. • In fact, the F-test is identical to the square of the t-test: 312.1 = (17.67)^2. This is true in any model with 1 degree of freedom. • The F-statistic becomes more important once we start using multiple predictors as in multiple linear regression. • A large F-statistic corresponds to a statistically significant p-value (p < 0.05). In our example, the F-statistic equals 312.14, producing a p-value of 1.46e-42 (or 0.00000000000000000000000000000000000000000146), which is highly significant. Model Accuracy 3: F-statistic Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem
  • 45. • After computing a regression model, a first step is to check whether, at least, one predictor is significantly associated with outcome variables. • If one or more predictors are significant, the second step is to assess how well the model fits the data by inspecting the Residuals Standard Error (RSE), the R2 value and the F-statistics. • These metrics give the overall quality of the model. Summary Source: http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-regression-in-r/#examples-of-data-and-problem Residual Standard Error (RSE) Closer to zero the better R-Squared Larger the better F-statistic Larger the better
  • 46. • Bootstrapping is a popular statistical procedure that resamples a single dataset to create many simulated samples in order to calculate standard errors, build confidence intervals, and perform hypothesis testing. • Simple linear regression is used to predict a quantitative outcome y on the basis of one single predictor variable x. • The objective is to formulate a model that defines y as a function of the x variable. • Once we built a statistically significant model, it is then possible to use it for predicting future outcome on the basis of new x values. • The mathematical formula of the linear regression can be written as y = b0 + b1*x + e, where b0 and b1 are the regression parameters and e is the error term that refers to the part of y that cannot be explained by the regression model. • The larger the t-statistic – and, consequently, the lower the p-value, the more significant the predictor variable x. • The overall quality of the linear regression fit can be assessed using the following three quantities: Residual Standard Error RSE (the close to zero the better), R-squared (the larger the better), and F-statistic (the larger the better). Takeaways
  • 47. • Brooks, C. (2019). Introductory Econometrics for Finance. Cambridge University Press. • Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media. • Evans, J. R., Olson, D. L., & Olson, D. L. (2007). Statistics, Data Analysis, and Decision Modeling. New Jersey: Pearson/Prentice Hall. • Freed, N., Jones, S., & Bergquist, T. (2013). Understanding Business Statistics. Wiley Global Education. • http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear- regression-in-r/#examples-of-data-and-problem • James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated. • Render, B., Stair Jr, R. M., Hanna, M. E., & Hale, T. S. (2018). Quantitative Analysis for Management, 13e. Prentice Hall. References
  • 48. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Any Questions?
  • 49. Basics of Data • Statistics is the science of data. • Data consist of the facts or figures that are the subject of summarisation, analysis, modelling, and presentation. • A dataset is a collection of data with some common connection. For instance, the GDP of European countries from 2010 to 2020. • A variable is a particular characteristic of interest within a group of observations. For instance, the GDP of Germany. • An observation (observational unit or case) is a particular value comprising a variable. An example can be the GDP of Germany in 2020. Thank You!