Materi_Business_Intelligence_1.pdf

Data Analytic: The Use of Statistical Methods
Univariate and Multivariate Regressions
Hasan Dwi Cahyono1
1Informatika
Universitas Sebelas Maret (UNS) Surakarta
Business Intelligence Course, February 2023
Cahyono, Hasan D. (UNS) Informatika UNS BI 2023 1 / 23

Table of Contents
1 One Estimator (Univariate)
2 Multiple Linear Regression (Multivariate)
3 Reference

Table of Contents
3 Reference

The coefficient estimate: How to measure the accuracy?
After having the β0 and β1 (previous meeting), we need to know how well
our estimates are. To do this assessment, let’s make a plot:
Figure 1: The linear function is Y=10+5X+error
Each of those lines provides a reasonable estimate.
So, which line provides the best estimate?

We can see from the previous plot that the true (population)
regression line red is being surrounded by some (ten) estimates.
The ideal comparison would be the real-world (population) data
which is likely difficult to obtain.
We can start by estimating the true mean µ using the sample mean
µ̂. This is because µ̂ will not systematically over/under estimate µ
(unbiased).
So, using sufficiently large sample to estimate µ̂ (for each
observation), the average will exactly equal to µ.
Var(µ̂) = SE(µ̂)2
=
σ2
n
(1)
where σ2 = Var(). Using Law of Large Number (LLM) – not
mention in this course, the standard error decreases as µ̂ will get
closer to the µ.

Measure how close our ˆ
β0 and ˆ
β1 are to β0 and β1
To make the intercept β0 and slope β1 estimation much convenience, we
will assume that X is fixed (not random). Remember that:
β̂1 =
P
Xi − X̄

Yi − Ȳ

P
Xi − X̄
2
(2)
So, we can have:
SE

β̂1
2
= Var
P
i (xi − x̄) (yi − ȳ)
P
i (xi − x̄)2
!
(3)
As
X
i
(xi − x̄) ȳ = ȳ
X
i
(xi − x̄) = ȳ
X
i
xi
!
− nx̄
!
= ȳ(nx̄ − nx̄) = 0
(4)
Therefore, we have:
X
(xi − x̄) (yi − ȳ) =
X
(xi − x̄) yi (5)

β0 and ˆ
Var

β̂1

= Var
P
i (xi − x̄) yi
P
i (xi − x̄)2
!
=
1
P
i (xi − x̄)2
2
Var
X
i
(xi − x̄) yi
!
=
1
P
i (xi − x̄)2
2
X
i
(xi − x̄)2
Var (yi )
=
1
P
i (xi − x̄)2
2
X
i
(xi − x̄)2
σ2
=
1
P
i (xi − x̄)2
2
σ2
X
i
(xi − x̄)2
=
σ2
P
i (xi − x̄)2
(6)

β0 and ˆ
Let’s start with the definition of ˆ
β0:
β̂0 = ȳ − β̂1x̄ (7)
So,
Var

β̂0

= Var

ȳ − β̂1x̄

= Var(ȳ) + Var

−β̂1x̄

= Var(ȳ) + (−x̄)2
Var

β̂1

= Var(ȳ) + x̄2 σ2
Pn
i=1 (xi − x̄)2
= Var
1
n
n
X
i=1
yi
!
+ x̄2 σ2
Pn
i=1 (xi − x̄)2
= Var
1
n
n
X
i=1
(β0 + β1xi + i )
!
+ x̄2 σ2
Pn
i=1 (xi − x̄)2
(8)
As i is independent (uncorrelated),
Pn
i=1 Var (β0 + β1xi + i ) =
Pn
i=1 Var (i )

β0 and ˆ
Let’s start with the definition of ˆ
β0:
β̂0 = ȳ − β̂1x̄ (9)
So,
Var

β̂0

=
1
n2
n
X
i=1
Var (i ) + x̄2 σ2
Pn
i=1 (xi − x̄)2
=
1
n2
n
X
i=1
σ2
+ x̄2 σ2
Pn
i=1 (xi − x̄)2
=
1
n
σ2
+ x̄2 σ2
Pn
i=1 (xi − x̄)2
= σ2

1
n
+
x̄2
Pn
i=1 (xi − x̄)2
#
(10)

Figure 2: Confidence interval (Source: Quora).

The range of ˆ
β0 and ˆ
β1: Confidence Interval
The common choice is 95%
confidence interval for either ˆ
β0 and
ˆ
β1. And the equation is denoted as:
β̂1 ± 2 ∗ SE

β̂1

β̂0 ± 2 ∗ SE

β̂0
(11)
We can also use Standard Error (SE)
to do the hypotheses tests
H0 : There is no relationship
between X and Y , or β1 = 0
(null hypothesis)
H1 : There exists a relationship
between X and Y , or β1 6= 0
(alternate hypothesis)
Figure 3: The different between normal
distribution and t-distribution
Typically, we consider the null
hypothesis more than the alternate
hypothesis. In this regard, we can
measure how far β̂1 from 0 and we
need this to know if β̂1 6= 0.

t − statistics: to Measure how far ˆ
β1 from zero
We can use the t-distribution as:
t = (β1 − 0) /SE

β̂1

(12)
When there’s no relationship between X and Y, t will follow a
t-distribution.
We can use t-distribution to compute the probability of any observed
number |t|
This probability is known as p-value with a common threshold α is
0.05.
Example: we have t-value using two tail sample test is -1.608761 and
find that a corresponding p-value for this -1.608761 is 0.121926.
Since the threshold is 0.05, we fail to reject the null hypothesis.
Meaning that we don’t have enough evidence to say X has no
relationship to Y. Therefore, we can accept the alternate hypothesis.

Table of Contents
3 Reference

Simple linear regression
Univariate regression works well when the data only needs a single
predictor. In real world application, however, more variables are commonly
involved. Thus, we need to extend the simple univariate regression to
involve more predictors.We can see the multivariate variable regression as:
Y = β0 + β1X1 + β2X2 + . . . + βpXp + (13)
Just like what we have with simple univariate regression, we first need to
estimate the regression coefficients:
ŷ = β̂0 + β̂1X1 + β̂2X2 + . . . + β̂pXp (14)
with the goal to minimize the residual sum of squared (RSS):
RSS =
n
X
i=1
(yi − ŷi )2
=
n
X
i=1

yi − β̂0 − β̂1xi1 − β̂2xi2 − . . . − β̂pxip
2
(15)

Finding a relationship between Response and Predictor
We need to test the null hypotheses
H0 : β1 = β2 = . . . = βp = 0 (16)
in relative to
H1 : at least one βj is non-zero (17)
With the hypothesis test is performed by calculating the F-Statistic
F =
(TSS − RSS)/p
RSS/(n − p − 1)
(18)
where TSS =
Pn
i=1 (yi − ȳ)2
is the total sum of squares.
We can consider TSS as the amount of total variability in the
response variable before any model is fitted to it.
On the other hand, RSS is performed after fitting a model, and
measures the amount of unexplained variance remaining in the data.

Assuming that the linear model is correct, we can show that:
E{RSS/(n − p − 1)} = σ2
(19)
and when H0 is true,
E{(TSS − RSS)/p} = σ2
(20)
In brief, when H0 were true, we expect to have regression coefficients
of 0.
If H0 were true, the unexplained variance of the model to be
approximately equal to that of the total variance with numerator and
denominator of F-statistic has an equal value.
When there’s no relationship between response and predictors,
F-statistic will have almost 1.

Selecting the important variables
In a multiple regression, finding the F-statistic is the first thing we do and
then we can use at least one of the predictors has relation to the response
Y.
We can say that this step is the variable selection. In this step, integrating
different models of regression functions with combinations of predictors,
2p.

A common approach involves three different ways:
Forward selection: first, we select a model without any predictor
(null model). Then, we can fit p simple linear regressions to the
nullmodel one at a time and find the model with the lowest RSS.
Next, we can add another variable and finding the lowest RSS. We
can continue doing this step until a predetermined rule is fulfilled.
Backward selection: we can start with all the predictors in the model
and then gradually remove the variable with the largest p − value.
Then, we keep on doing this elimination step with (p − 1) predictors.
We can keep on doing this elimination step until the stopping criteria
is met (typically, p − value threshold.
Mixed selection: We can start with Forward selection. Then we add a
new predictor. But when the added predictor passes the p − value, we
remove it. This process continue forward and backward until we can
find the predictor combinations that have a relatively low p − value.

Model Fitting
To decide the best fit model, we need a measure. Root square error RSE
and R2
are the common metrics.
RSE =
r
1
n − 2
RSS =
v
u
u
t 1
n − 2
n
X
i=1
(yi − ŷi )2
(21)
R2
= 1 −
Unexplained Variation
Total Variation
(22)
How about trying to make the proof of the RSS and R2?

RSE
We can use RSE to measure
how bad our model is to fit the
data.
When the RSE is very close to
the actual value, we can say
that your model fits the data
quite well and vice-versa.
However, you may need to have
general understanding about the
data. Not much information
about RSE. So, use RSE
wisely.
R2
Closer to 1 means a better fit.
You may need to think carefully
when using R2
, since adding
more variables potentially
increase the it. But, adding
more variables that less
significantly increase the R2
can
lead to over-fitting.

Predictions
For any given models, we should consider 3 different things:
There will a inaccuracy when we try to estimate β̂0 + β̂1 . . . , β̂p for
the true estimates β0 + β1 . . . , βp. This inaccuracy is known as
reducible error. We can use confidence interval to see how close Ŷ is
to Y.
Model bias can occur when we use a linear approximation to the true
surface of f (X).
When we know the true function f (X), but we still have random error
ε, this is unavoidable. We can use prediction intervals to estimate
how far Y is to Ŷ . We can see that prediction intervals is always
wider than confidence intervals.

Table of Contents
3 Reference

Reference
James, G., Witten, D., Hastie, T., Tibshirani, R. (2016). An
introduction to statistical learning (Vol. 6). New York: Springer
https://dionysus.psych.wisc.edu/iaml/unit-03.html

Materi_Business_Intelligence_1.pdf

Recommended

Recommended

More Related Content

Similar to Materi_Business_Intelligence_1.pdf

Similar to Materi_Business_Intelligence_1.pdf (20)

Recently uploaded

Recently uploaded (20)

Materi_Business_Intelligence_1.pdf