SlideShare a Scribd company logo
1 of 212
Download to read offline
BUSINESS STATISTICS II
PART II: Lectures Weeks 11 – 19
Antonio Rivero Ostoic
School of Business and Social Sciences
 March −  May 
AARHUS
UNIVERSITYAU
BUSINESS STATISTICS II
Lecture – Week 11
Antonio Rivero Ostoic
School of Business and Social Sciences
 March 
AARHUS
UNIVERSITYAU
Today’s Outline
Simple regression analysis
Estimation in a simple regression model
 (we use now SPSS)
2 / 28
Introduction
Galton (Darwin’s half-cousin) found in his observations that:
– For short fathers, on average the son will be taller than his father
– For tall fathers, on average the son will be shorter than his father
Then he characterized these results with the notion of the
“regression to the mean”
Pearson and Lee took Galton’s law about the relationship between
heights of children and parents, and came up with the regression
line:
son’s height = 33.73 + .516 × father’s height
ª This equation shows that for additional inch of father’s height the son’s
height increases on average by .516
3 / 28
Regression Analysis
Regression analysis is used to predict one variable on the basis of
other variables
ª i.e. to forecasting
It serves from a model that describes the relationship between a
variable to estimate and the variables that influences this variable
– Response variable is called dependent variable, y
– Explanatory variables are called independent variables,
x1, x2, . . . , xk
 Correlation analysis serves to determine whether a relationship
exists or not between variables
 Does regression imply causation?
4 / 28
Model
A model comprises mathematical equations that accurately
describes the nature of the relationship between DV and IVs
Example for a deterministic model:
F = P(1 + i)n
where
F = future value of an investment
P = present value
i = interest rate per period
n = number of periods
ª In this case we determine F from the values on the equation’s right hand
5 / 28
Probabilistic model
However, deterministic models can be sometimes unrealistic,
since other variables that are unmeasurable and not known
can influence the dependent variable
Such types of variables represent uncertainty in real life and
it should be included in the model
In this case we rather use a probabilistic model in order to
incorporate such randomness
A probabilistic model then incorporates and unknown
parameter called the error variable
ª it accounts for all measurable and immeasurable variables that are not
part of the model
6 / 28
Simple linear regression model
i.e. First Order model
y = β0 + β1x +
where
y = dependent variable
x = independent variable
β = coefficients
β0 = y-intercept
β1 = slope of the line (rise/run) or (∆Y/∆X)
= error variable
ª Coefficients are population parameters, which need to be estimated
ª The assumption is that the errors are normally distributed
7 / 28
Expected values and variance for y
The expected value for y it is a linear function of x, and y differs
from its expected value by a random amount
ª linear regression is a probabilistic model
For x∗
= a particular value of x:
E(y | x∗
) = µy|x∗ (mean)
V(y | x∗
) = σ2
y|x∗ (variance)
8 / 28
Estimating the Coefficients
We estimate the coefficients as we estimated population
parameters
That is draw a random sample from the population and
calculate sample statistics
But here the coefficients are part of a straight line, and we need
to estimate the line that represents ‘best’ the sample data points
Least squares line
ˆy = b0 + b1x
 here b0 = y-intercept, b1 = slope, and ˆy is the fitted value of y
9 / 28
Least squares method
cf. chap. 4 in Keller
The least square method is an objective procedure to obtain a
straight line, where the sum of squared deviations between the
points and the line is minimized
n
i=1
(yi − ˆyi)2
The least squares line coefficients
b1 =
sxy
s2
x
b0 = y − b1x
10 / 28
Least squares line coefficients
For b1 and b0
sxy =
n
i=1(xi − x)(yi − y)
n − 1
s2
x =
n
i=1(xi − x)2
n − 1
x =
n
i=1 xi
n
y =
n
i=1 yi
n
11 / 28
Least squares line coefficients
This actually means that the values of ˆy on average come
closest to the observed values of y
There are shortcut formula for b1 (check sample variance pp.
110, and sample covariance pp.127)
b0 and b1 are unbiased estimators of β0 and β1
12 / 28
EXAMPLE 16.1
Annual Bonus and Years of Experience
 Determine the straight-line relationship between annual bonus and
years of experience
13 / 28
Working with SPSS
In SPSS we distinguish two main working windows:
1) Data Editor, where the raw data and variables are
displayed
2) Statistics Viewer, where scripts and reports are provided
Both windows have: MENU SUBMENU ... COMMAND
 Each command corresponds to a function that bears one or
several ARGUMENTS
14 / 28
Working with SPSS
Command-line like
It is also possible to work directly with the functions
Example of the script for a regression:
REGRESSION
/DEPENDENT dependent-variable
/ENTER List-of.independents.
 SPSS distinguishes between COMMANDS, FILES, VARIABLES, and
TRANSFORMATION EXPRESSIONS
15 / 28
Data Editor in SPSS
Analyze Regression Linear
16 / 28
Report in SPSS
GET
FILE='C:auspssxm16-01.sav'.
DATASET NAME DataSet1 WINDOW=FRONT.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Bonus
/METHOD=ENTER Years.
Regression
Notes
Output Created
Comments
Input Data
Active Dataset
Filter
Weight
Split File
N of Rows in Working
Data File
Missing Value Handling Definition of Missing
Cases Used
Syntax
06-MAR-2014 12:45:25
C:auspssxm16-01.sav
DataSet1
none
none
none
6
User-defined missing values are
treated as missing.
Statistics are based on cases with no
missing values for any variable used.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R
ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Bonus
/METHOD=ENTER Years. 17 / 28
Regression Report in SPSS
Variables Entered/Removeda
Model
Variables
Entered
Variables
Removed Method
1 Yearsb
. Enter
Dependent Variable: Bonusa.
All requested variables entered.b.
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
1 ,701a
,491 ,364 4,503
Predictors: (Constant), Yearsa.
ANOVAa
Model
Sum of
Squares df Mean Square F Sig.
1 Regression
Residual
Total
78,229 1 78,229 3,858 ,121b
81,105 4 20,276
159,333 5
Dependent Variable: Bonusa.
Predictors: (Constant), Yearsb.
Coefficientsa
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.B Std. Error Beta
1 (Constant)
Years
,933 4,192 ,223 ,835
2,114 1,076 ,701 1,964 ,121
Dependent Variable: Bonusa.
18 / 28
Regression Plot from SPSS
Graphs Legacy Dialogs Scatter/Dot... Simple Scatter
Years
654321
Bonus
20
15
10
5
0
y=0,93+2,11*x
R2 Linear = 0,491
19 / 28
Calculation of Residuals
The deviations of the actual data points to the line are the
residuals, which represents observations of
ei = yi − ˆyi
In this case the sum of squares for error (SSE) represents the
minimized sum of squared deviations
ª basis for other statistics to assess how well the linear model fits the data
 The standard error of the estimate is the square root of the
proportion of SSE and the number of observations
ª Remember that in SPSS the value of the residuals is given in the Anova
table of the regression report
20 / 28
Annual bonus and years of experience
q
q
q
q
q
q
0 1 2 3 4 5 6 7
051015
years
bonus
21 / 28
Annual bonus and years of experience: Residuals
q
q
q
q
q
q
0 1 2 3 4 5 6 7
051015
years
bonus
2.9524
−4.1619
1.7238
−4.3905
5.4952
−1.619
22 / 28
Regression examples
Finance/economy:
– The enterprise equity value and total sales
– Number of VP executives and total assets
– Quantity of new houses and amount of jobs created in a city
– Amount of bananas harvest and the density of banana trees per km2
Social/health:
– Number of violent crime and the poverty rate
– Amount of infectious diseases and population growth
– Amount of diseases from chronic illnesses and urbanization level
– Number of kinds raised and the number of spouses
24 / 28
Regression examples
Miscellaneous:
– IQ score development and the average global temperature per year
– If a horse can run X mph, how fast will his offspring run?
– Number of cigarettes smoked and number of chats having with people
– Number of cigarettes smoked and time at the hospital
ª (more politically correct!)
That is, questions like:
– For any set of values on an independent variable, what is my predicted
value of a dependent variable?
– If an independent variable raises its value by 1-unit, how the dependent
variable results?
25 / 28
EXAMPLE-DO-IT-YOUR-SELF
[A predicted variable and a predictor variable]
 We do examples with random numbers...
26 / 28
Generating Random Numbers in SPSS
Variable View:
Create two variables for integers
Data View:
Choose number of observations in each variable
Transform Compute Variable
Arguments:
Variable names in Target Variable, and Random Numbers
in Function group
 Choose uniform rv and establish the range of the obs. values
27 / 28
Summary
Simple linear regression analysis is for the relationship
between two interval variables
The assumption is that the variables are linearly connected
The intercept and the slope of the regression line are the
coefficients to be estimated
The least squares method produces estimates of these
population parameters
28 / 28
BUSINESS STATISTICS II
Lecture – Week 12
Antonio Rivero Ostoic
School of Business and Social Sciences
 March 
AARHUS
UNIVERSITYAU
Today’s Outline
Review simple linear regression analysis
Error variable in regression
Model Assessment
– standard error of estimate
– testing the slope
– coefficient of determination
– other measures
2 / 26
Review Simple Linear Regression Analysis
Simple regression analysis serves to predict the value of a
variable from the value of another variable
A lineal regression model describes the variability of the data
around the regression line
The observations on a dependent variable y is a linear function
of the observation on an independent variable x
The population parameters are expressed in in two coefficients,
the y-intercept and the slope of the line, which need to be
estimated, plus a stochastic part
ª y-intercept: the value of y when x equals 0
ª slope: the change in y for one-unit increase in x
3 / 26
The Error Variable
Remember that in probabilistic models we need to account for
unknown and unmeasurable variables that represent noise or error
The error variable is critical in estimating the regression coefficients
– to establish whether there is a relationship between the dependent
and independent variables via an inferential method
– to estimate and predict through a regression equation
Errors are independent to each other and this variable is normally
distributed with mean 0 and standard deviation σ
ª This is expressed as ∼ N(0, σ )
4 / 26
Expected values of y
The dependent variable can be considered as a random
variable normally distributed with expected values
E(y) = β0 + β1x (mean)
σ(y) = σ (standard deviation)
Thus the mean of y depends on the value of the independent
variable, whereas its standard deviation don’t
 shape of the distribution remains, but E(y) changes according to x
5 / 26
Experimental data and Observations
We have been typically working with examples based on observations
However it is also possible to perform a controlled trial where we
generate experimental data
Regression analysis works with both types of data, since the main
goal is to determinate how the IV is related to the DV
For observations both variables are random, which joint probability is
characterized by the bivariate normal distribution
ª here the z dimension is a joint density function of the two variables
These types of normality conditions are assumptions for the
estimations in a simple linear regression model
6 / 26
Assessing the Model
We use the least squares method to produce the best straight line
But a straight line may not be the best representation of the data
We need to assess how well the linear model fits the data
Methods to assess the model:
– standard error of estimate
– the t-test of the slope
– the coefficient of determination
 all based on the SSE
7 / 26
Standard error of estimate
Recall the error variable assumptions: ∼ N(0, σ )
And the model is considered poor if σ is large, and it is considered
perfect when the value is 0
Unfortunately we do not know this parameter, and we need to
estimate σ from the sample data
The estimation is based on the sum of squares for error (SSE)
ª which is the minimized sum of squared deviations between the points and the
regression line
SSE =
n
i=1
(yi − ˆyi)2
= (n − 1) s2
y −
s2
xy
s2
x
8 / 26
Standard error of estimate
The standard error of estimate is the approximation of the
conditional standard deviation of the dependent variable
ª that is, the square root of the residual sum of squares divided by the
number of degrees of freedom
s =
SSE
n − 2
This is the square root of s2, which in fact is the MSE
ª the df is actually number of cases − number of unknown parameters
IN THE SPSS REPORT:
The value for s is given in the Model Summary table for a linear
regression analysis
9 / 26
Testing the slope
In this case we test whether or not the dependent
variable is not linearly related to the independent
variable
ª this means that no matter what value x has, we would obtain
the same value for ˆy
In other words, the slope of the line represented by β1
equals zero, and this corresponds to a horizontal line
in the plot
10 / 26
Testing the slope: Uniform distribution with β1 = 0
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
x
y
11 / 26
Testing the slope
If our null hypothesis is that there is no a linear relationship among
the dependent and independent variables, then we specify
H0 : β1 = 0
H1 : β1 = 0 (two-tail test)
If we do not reject H0, we either committed a Type II error (wrongly
accepting the null hypothesis), or there is not much of a ‘linear’ relationship
between the independent variable and the dependent variable
 However the relationship can be a quadratic, which corresponds to
a polynomial regression
ª In case we want to check for a positive (β1  0) or a negative (β1  0)
linear relationship among the IV and DV, then we perform a one-tail test
12 / 26
Quadratic relationship with β1 = 0
β1 = 0
x
y
 a quadratic model: y = β0 + β1x + β2x2
+
13 / 26
Estimator and sampling distribution
For drawing inferences, b1 as an unbiased estimator of β1
E(b1) = β1
with an estimated SE
sb1
=
s
(n − 1)s2
x
that is based on the sample variance of x
14 / 26
Estimator and sampling distribution
If ∼ N(0, σ ) with values independent to each other, then we
use the t-statistics sampling distribution
Test statistics for β1
t =
b1 − β1
sb1
Thus the t-statistic values are proportion of coefficients to their SE
IN THE SPSS REPORT:
The t-statistic values are given in the Coefficients table of the linear
regression analysis
15 / 26
Estimator and sampling distribution
Confidential Interval estimator of β1
b1 ± tα/2 sb1
Test statistics and confidence interval estimators are for a
Student t distribution with v = n − 2
IN SPSS:
Confidence intervals are line Properties in the graph Chat Editor
16 / 26
Coefficient of Determination
To measure the strength of the linear relationship we use the
coefficient of determination, R2
ª useful to compare different models
R2
=
s2
xy
s2
xs2
y
This is equal to
R2
= 1 −
SSE
(yi − y)2
17 / 26
Partitioning deviations in Example 16.1 i = 5
q
q
q
q
q
q
0 1 2 3 4 5 6 7
051015
years
bonus
18 / 26
Partitioning deviations in Example 16.1 i = 5
q
q
q
q
q
q
0 1 2 3 4 5 6 7
051015
years
bonus
x = 3.5
y = 8.33
yi = 17
xi = 5
y^
i = 11.504
19 / 26
Partitioning deviations in Example 16.1 i = 5
q
q
q
q
q
q
0 1 2 3 4 5 6 7
051015
years
bonus
x = 3.5
y = 8.33
yi = 17
xi = 5
y^
i = 11.504
yi − y^
i
y^
i − y
yi − y
xi − x
20 / 26
Partitioning deviations in Example 16.1 i = 2
q
q
q
q
q
q
0 1 2 3 4 5 6 7
051015
years
bonus
yi = 1 xi = 2
y^
i = 5.162
21 / 26
Partitioning the deviations
(yi − y) = (ˆyi − y) + (yi − ˆyi)
The difference between yi and y is a measure of the variation in the
dependent variable, and it equals to:
a) the difference between ˆyi and y, which is accounted by the difference
between xi and x
ª the variation in the DV is explained by the changes of the IV
b) and the difference between yi and ˆyi, which represents an unexplained
variation in x
 If we square all parts of the equation, and sum over all sample points,
we end up with a statistic for the variation in y
total SS = explained SS + residual SS
ª i.e. sum of squares for regression (SSR) and the sum of squares for error (SSE)
22 / 26
Coefficient of Determination
R2 = 1 − SSE
(yi−y)2
=
(yi−y)2
(yi−y)2 − SSE
(yi−y)2
=
(yi−y)2 − SSE
(yi−y)2 = SS(Total) − SSE
SS(Total)
 This is the proportion of variation explained by the regression model,
which is the proportion of variation in y explained by x
IN THE SPSS REPORT:
R2
is given in the Model Summary table of the regression analysis
23 / 26
Other measures to assess the model
Correlation coefficient
r =
sxy
sxsy
We use t-test for H0 : ρ = 0
t = r
n − 2
1 − r2
which is t distributed with v = n − 2 and variables bivariate distributed
Calculate r in SPSS
Analyze Correlate Bivariate (select variables and choose Pearson)
24 / 26
Other measures to assess the model
F-test
F =
MSR
MSE
for MSR = SSR/1 and MSE = SSE/(n − 2)
 This statistic is to test H0 : β1 = 0
IN THE SPSS REPORT:
• F-statistic value is given in the Anova table
• Value of r is in the Model Summary table, whereas the t statistics is
given in the table for the Coefficients in the regression analysis
25 / 26
Summary
The error variable corresponds to the probabilistic part of the
regression model
ª independent values that are normally distributed with mean 0 and sd σ
The standard error of estimate serves to evaluate the
regression model by assessing the conditional standard
deviation of the dependent variable
By testing the slope we can check whether there is a linear
relationship or not between the independent and the
dependent variables
The coefficient of determination measures the strength of the
linear relationship in the regression model
26 / 26
BUSINESS STATISTICS II
Lecture – Week 13
Antonio Rivero Ostoic
School of Business and Social Sciences
 March 
AARHUS
UNIVERSITYAU
Today’s Outline
The equation of the regression model
Regression diagnostics
2 / 31
Regression Equation
The regression equation represents the model, where the dependent
variable is the response of an independent explanatory variable
ª the model stands for the entire population
After assessing the model, our next task is to estimate and predict
the values of the dependent variable
In this case we differentiate the average response at the dependent
variable from the prediction of the dependent variable from a new
observation in the independent variable
3 / 31
Estimating a mean value and predicting an individual value
If a linear model such as
y = β0 + β1x
is considered satisfactory for the data, then
ˆy = b0 + b1x
will represent the sample equation for the estimation of the
model
ª (Here we predict the error term to be 0)
4 / 31
Estimating a mean value and predicting an individual value
For x∗ representing a specific value of the independent variable:
ˆy = b0 + b1x∗
– is the point prediction of an individual value of the dependent
variable when the value of the independent variable is x∗
– is the point estimate of the mean value of the dependent
variable when the value of the independent variable is x∗
5 / 31
Interval estimators
A small p-value for H0 : β1 = 0 suggests a nonzero slope in
the regression line
However, for a better judgment we need to see how closely
the predicted value matches the true value of y
There are two interval estimators:
a) Prediction interval that predicts y for a given value of x
b) Confidence interval estimator that estimates the mean
of y for a given value of x
6 / 31
Prediction interval
individual intervals
ª Used if we want to predict a one-time occurrence for a particular value of y when x
has a given value
For ˆy = b0 + b1xg the prediction interval is
ˆy ± tα/2,n−2 s 1 +
1
n
+
(xg − x)2
(n − 1)s2
x
where xg is the given value of the independent variable
 Another way to express this CI is x∗
→ ˆy∗
, which implies that for x∗
that is
a new value of x (or for a tested value of x) the prediction interval for ˆy∗
is
ˆy∗
± tα/2,n−2 MSE 1 +
1
n
+
(x∗ − x)2
sxx
7 / 31
Confidence interval estimator
the average prediction interval
For E(y) = β0 + β1x (i.e. for the mean of the dependent
variable) the confidence interval estimator is
ˆy ± tα/2,n−2 s
1
n
+
(xg − x)2
(n − 1)s2
x
 That is, for x∗
→ ˆy∗
, the mean prediction interval for ˆy∗
is
ˆy∗
± tα/2,n−2 MSE
1
n
+
(x∗ − x)2
sxx
ª where MSE equals to ˆσ 2
, whereas sxx is the unnormalized form of V(X)
8 / 31
EXAMPLE-DO-IT-YOUR-SELF
[A predicted variable and a predictor variable]
Data generation in SPSS
• Choose your DV and IV, and number of observations. Then generate
uniform random numbers:
Transform Compute Variable...
• Variable names in Target Variable , and Random Numbers in Function group
• Select Rv.Uniform in Functions and Special Variables , and then establish the
range of the observation values in Numeric Expression
9 / 31
EXAMPLE-DO-IT-YOUR-SELF
[A predicted variable and a predictor variable]
Confidence intervals of the regression model in SPSS
• We perform the linear regression analysis
Analyze Regression Linear
• Individual confidential intervals are given in this command, where in the
bottom Save we select in Prediction Intervals
– the Individual option for Prediction Interval
– the Mean option for the Confidential Interval Estimator
Both at the usual 95% value
10 / 31
EXAMPLE-DO-IT-YOUR-SELF
[A predicted variable and a predictor variable]
Confidence intervals of the regression model in SPSS (2)
• Since we have chosen Save , the confidential interval values are saved in
the Data Editor
ª here LMCI [UMCI] and LICI [UICI] stand respectively for Lower
[Upper] Mean and Individual Confidence Interval
 The Variable View in the Data Editor gives the labels of the new variables
11 / 31
EXAMPLE-DO-IT-YOUR-SELF
[A predicted variable and a predictor variable]
Visualizing confidence intervals in SPSS
• The visualization of both types of confidence intervals are possible after we
plotted the variables
Graphs Legacy Dialogs Scatter/Dot... Simple Scatter
• From Elements Fit Line at Total of the graph Chart Editor, we look in the
tab Fit Line (Properties) the options Mean and Individual in the
Confidential Intervals section for the two CI estimators
12 / 31
Confidence bands from SPSS
Example 16.2 in Keller
Odometer
50,040,030,020,010,0
Price
16,5
16,0
15,5
15,0
14,5
14,0
13,5
y=17,25+-0,07*xy=17,25+-0,07*x
R2 Linear = 0,648
R2 Linear = 0,648
13 / 31
EXAMPLE-DO-IT-YOUR-SELF
[A predicted variable and a predictor variable]
Predict new observations in SPSS
• To forecast new observations, first we need to put the value in the
dependent variable of the Data Editor
• Then we choose a linear regression analysis
Analyze Regression Linear
• And, after we press the Save bottom, we select the Unstandardized
option in Predicted Values
14 / 31
Regression Diagnostics
Here we are concern with evaluating the prediction model that
includes some error or noise
ei = yi − ˆyi
 thus the residual equals each observation minus its estimated
value
Recall that in regression analysis there are some assumptions made
for the error variable
ª errors are independent to each other that are normally distributed, and
hence with a constant variance
15 / 31
Regression Diagnostics
A regression diagnostics checks for two things:
a) whether or not the conditions for the error are fulfil
b) for the unusual observations (those that fall far from the
regression line), and determine whether or not these
values results from a fault in the sampling
 we look at several diagnostic methods for unwanted conditions
16 / 31
Residual analysis
Residual analysis focus on the differences between the
observations and the predictions made in the linear model
Residual Analysis in SPSS
Residual analysis is based on standardized and unstandardized residuals
• After choosing linear regression analysis
Analyze Regression Linear
• When we press the Save bottom, we select the Standardized and
Unstandardized options in Predicted Values
ª Recall that these values are recorded in the Data View of the Data Editor
17 / 31
Nonnormality
The nonnormality check of the error variable is made by
visualizing the distribution of the residuals
ª we use the histogram for this
Nonnormality in SPSS
The histogram of residuals is obtained from
Graphs Legacy Dialogs Histogram...
• And we choose RES (which corresponds to the unstandardized
residuals) for the Variable option
18 / 31
Nonnormality
Nonnormality in SPSS (2)
It is also possible to obtain the distribution shape in the histogram
• In the Chart Editor we go to
Elements Show Distribution
and choose Normal
19 / 31
Heteroscedasticity
Heteroscedasticity (or heteroskedasticity) is the term used when
the assumption of equal variance of the error variable is violated
ª homoscedasticity has the opposite implication, meaning ‘homogeneity of
variance’
To test the heterogeneity of variance in the error variable we can
plot the residuals against the predicted values of the DV
ª then we look for the spreading of the points; if the variation in ei = yi − ˆyi
increases as yi increases, the errors are called heteroscedastic
This type of graph is sometimes called the ei − ˆyi plot
20 / 31
Heteroscedasticity
Heteroscedasticity in SPSS
The heteroskedasticity condition evaluated by the ei − ˆyi plot
Graphs Legacy Dialogs Scatter/Dot... Simple Scatter
• And choosing RES (the unstandardized residuals) for the Y-axis,
and PRE (the predicted values) for the X-axis
• For the mean line of the residuals in the plot we go to the Chart Editor (by
double-clicking the graph in the report) and in
Options Y Axis Reference Line
• Select the Mean option in the Reference Line tab of Properties
21 / 31
Nonindependence of the Error variable
The nonindependence of the errors means that the residuals are
autocorrelated, i.e. correlated over time
To detect autocorrelation we can plot the residuals in a time period
and look for alternating or increment patterns
ª If no clear pattern appears in the plot then there is an indication that the
residuals are independent to each other
Alternatively to detect lack of independence between errors
without time laps, we can perform the Durbin-Watson test
ª where the null hypothesis is that no correlation exists, whereas the alternative
hypothesis is that a correlation exists; i.e. H0 : ρ = 0, and H1 : ρ = 0
 we look at this test in multiple regression analysis...
22 / 31
Nonindependence of the Error variable
Nonindependence of the error variable in SPSS
We now create a time variable in the EXAMPLE-DO-IT-YOUR-SELF,
and then index the observations with a vector sequence
Transform Compute Variable...
• Index (time) variable in Target Variable , and the Miscellaneous
option in Function group
• Select $Casenum in Functions and Special Variables
23 / 31
Nonindependence of the Error variable
Nonindependence of the error variable in SPSS (2)
After obtaining the unstandardized residuals, we plot these values...
Graphs Legacy Dialogs Line... Simple
• We select the Mean of the unstandardized residuals is located in the
Line Represents option, and the time variable in Category Axis
 If we go to the Chart Editor we obtain the expected mean in
Options Y Axis Reference Line
24 / 31
Outliers
Outliers are unusual (small or large) observations in the sample,
which lie far away from the regression line
These points may suggest: an error in the sampling, a recording
mistake, an unusual observation
ª we should disregard the observation if case of one of the two first possibilities
To detect outliers:
– we serve from scatter diagrams of the IV and DV with the
regression line
– we check the standardized residuals where absolute values
larger than 2 may suggest an outlier
25 / 31
Outliers
Detection of outliers in SPSS
First we get the standardized residuals when choosing linear
regression analysis
Analyze Regression Linear
In the bottom Save we select the Standardized in Residuals
Then we obtain the absolute values of this variable
• ZRE 1 in Target Variable , and choose Arithmetic in Function group
• Select Abs in Functions and Special Variables and put this variable code in
the parentheses
26 / 31
Influential Observations
We serve from scatter diagrams of the IV and DV with the regression
line as well to evaluate the impact of influential observations
ª we produce two plots, one with and another without the supposed influential obs.
Optionally, to detect influential observation we can use different
measures as well:
Leverage describes the influence each observed value has
on the fitted value for this observation
ª where Mahalanobis distance is a measure of leverage of the observation
Cook’s D (distance) detects dominant observations, either
outliers or observations with high leverage
ª an Influence plot is made of the Studentized Residuals (ei/SE) against
the leverages of the observations (called ‘hat’ values)
27 / 31
Cook’s Distance
Example 16.2 in Keller
0 20 40 60 80 100
0.000.020.040.060.080.100.12
Obs. number
Cook'sdistance
Cook's distance
19
74
86
28 / 31
Influence plot (example 16.2 in Keller)
Areas of the circles are proportional to Cook’s distances
0.01 0.02 0.03 0.04 0.05 0.06 0.07
−2−1012
Hat−Values
StudentizedResiduals
q q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
8
19
29 / 31
Other aspects in Regression Diagnostics
• In the validation of linear model assumptions, we can also evaluate
the skewness, kurtosis in the distribution shape of the residuals...
• The prediction capability of the model can be assessed by looking
at the predicted SSE as well
 (in multiple regression we also look at the collinearity among IVs)
30 / 31
Summary
For a given explanatory variable, we differentiate the individual
value of the response variable from its mean value
Point estimation provides individual prediction intervals of the DV,
and confidence interval estimator approximates the mean of the
response variable
Regression diagnostics concerns with evaluating the prediction
model and the assumptions of the error variable
We look at the dominant points inducing the regression line for
assessing the prediction model, whereas much of the diagnostics
concentrates on the characteristics of the residuals
31 / 31
BUSINESS STATISTICS II
Lecture – Week 14
Antonio Rivero Ostoic
School of Business and Social Sciences
1st
April 2014
AARHUS
UNIVERSITYAU
Today’s Outline
Scaling and transformations
Standard error of estimates and standardized values
Step-by-step example with simple linear regression analysis
2 / 24
Scaling and transformations
Sometimes data transformation is needed in order obtain
e.g. a normal distribution
Transformations are mathematical adjustments applied to
scores in an attempt to make the distribution of the
outcomes fit requirements
Scaling (and re-scaling) is a linear transformation based on
proportions where the scores are enlarged or reduced
3 / 24
Data transformation
In a simple linear regression analysis we can perform a transformation
of both the explanatory and the response variables
For example in linear regression we may need to transform the data:
– when the residuals have a skewed distribution or they show
heteroscedasticity
– to linearize the relationship among the IV and the DV
– but also when the theory suggest a transformed expression
– or to simplify the model in a multiple regression model
4 / 24
Scaling and transformations
Examples of transformations of the variable x are:
– Square root:
√
x
– Reciprocal: 1/x
– Natural log: ln(x) or log(x)
– Log 10: log10(x)
In linear regression we use least squares fitting
ª this transformation allows the residuals to be treated as a continuous
differentiable quantity
5 / 24
Logarithmic transformations
linear regression analysis
Model
Linear
Linear-log
Log-linear
Log-log
Transformation
None
x = log(x)
y = log(y)
x = log(x)
y = log(y)
Regression equation
y = β0 + β1x
y = β0 + β1 log(x)
log(y) = β0 + β1x
log(y) = β0 + β1 log(x)
ª log are natural logarithms with base e ≈ 2.72
ª The term ‘level’ is also used instead of ‘linear’ in logarithmic transformations
6 / 24
Logarithmic transformations
linear regression analysis
Model
Linear
Linear-log
Log-linear
Log-log
Interpretation
A one unit increase in x would lead to a β1 increase/decrease in y
A one percent increase in x would lead to a β1/100
increase/decrease in y
A one unit increase in x would lead to a β1 ∗ 100% increase/
decrease in y
A one percent increase in x would lead to a β1%
increase/decrease in y
ª In econometrics, log-log relationships are referred as “elastic” and the
coefficient of log(x) as the elasticity
7 / 24
Standard Error of Estimates
SE = square root of the proportion of the squared differences
between criterion’s predicted and observed values and the df
The squared differences between criterion’s predicted and observed
values corresponds to the Residual SS (SSE in Anova)
ª it represents the unexplained variation in the model (or model deviance)
The df equals number of cases − number predictors in the model −1
ª in a simple linear regression model there is only one predictor, and df equal n − 2
Thus most of the calculation for the SE of estimates corresponds to
the Residual SS
8 / 24
SE and Residual SS
SSE in SPSS
After having the data, to obtain the SSE we need first the predicted
values of our model
Analyze Regression Linear
• And in Save choose the Unstandardized option in Predicted
Values
9 / 24
SE and Residual SS
SSE in SPSS (2)
Then we calculate by hand the residuals (yi − ˆyi) in a new variable
created in the Variable View. We name this variable as RESID
• Then we go to
Transform Compute Variable...
and place RESID in Target Variable , and make the subtraction
operation with the expression:
DV − PRE 1
10 / 24
SE and Residual SS
SSE in SPSS (3)
The next step is to obtain the square of the residuals, and we the
recent created variable (named RESID) for this.
Thus transformation of the residual values to their squares is
obtained after we place RESID is in Target Variable and type in the
Numeric Expression field the square of the values:
RESID ∗∗
2
11 / 24
SE and Residual SS
SSE in SPSS (4)
The sum of squares of the residuals, which is the numerator of the
SE, is obtained when we sum the values of this last variable
Analyze Reports Report Summaries in Columns...
and choose RESID for the Data Columns and select Display
grand total in Options . The Residual SS or SSE is given in the
Report of the Statistics Viewer as Grand Total.
ª in SPSS the SE of estimates is given in Model Summary, and the SSE and
df values are in the ANOVA table
12 / 24
Standardized values
Standardized values have been transformed into a customary scale
Standardized Coefficient
In linear regression the standardized coefficient is the product
of the regression coefficient and the proportion of the standard
deviations of the DV and the IV
That is Beta (in SPSS) equals B ∗ (s(x)/s(y))
The standardized coefficient represents the change in the
mean of the dependent variable, in y standard deviations, for a
one standard deviation increase in the independent variable
13 / 24
Standardized values
Standardized Residuals
In SPSS we count with various types of residuals:
– RES 1 stands for unstandardized residuals
– SRE 1 stands for Studentized residuals
– ZRE 1 stands for standardized residuals
And Keller (pp 653) tells us about the standardization of
variables in general and of the residuals in particular
ª subtract the mean and divide by the standard deviation
14 / 24
Standardized residuals
We get the Excel output table with the standardized residuals
for Example 16.2 (Keller, pp 653)
Now let us look at the SPSS results for this data...
? Hmmmmmmmmmmmmmm.... ?
15 / 24
Standardized residuals
The term ‘standardized residual’ is not a standardized term
In Keller “Standardized” residuals are residuals divided by the
standard error of the estimate (residual) (cf. pp 653)
However in SPSS these values (cf. Excel output pp 653)
correspond to the “Studentized” residuals
ª (even though the definition is for the Studentized deleted residuals)
In SPSS a standardized residual is the residual divided by the
standard deviation of data
ª Studentized residuals (another form for standardization) have a constant variance,
and combine the magnitude of the residual and the measure of influence
16 / 24
Standardized residuals
speaking the same language
Residuals (unstandardized) are the difference between
observations and expected values:
ˆ = y − ˆy
In the case of a regression model standardized residuals are
normalized to a unit variance
The standard deviation or the square root of the variance of the
residuals corresponds to the sqrt of MSE (cf. lec. week 12)
ª this is also known as the root-mean-square deviation
Standardized residual = residual /
√
MSE
17 / 24
Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
Be aware that in this case the model is chosen in advance, and
we adopt a linear relationship between two variables
18 / 24
Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
1. Determine the response and the explanatory variables
2. Visualize the data through a scatter plot
3. Perform basic descriptive statistics
19 / 24
Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
4. Estimate the coefficients (intercept and slope)
5. Compute the fitted values and the residuals
6. Obtain the sum of squares for errors (Residual SS)
20 / 24
Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
7. Estimate the coefficients (intercept and slope)
a) standard error of estimate
b) test of the slope
c) coefficient of determination
21 / 24
Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
8. Perform the regression diagnostics
a) confidence regions for individual prediction intervals
b) confidence regions for the average prediction interval
9. Make a residual analysis
a) nonnormality, heteroskedasticity, nonindependence errors
22 / 24
Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
10. Detect outliers and influential observations
11. Interpret the results
12. Draw the conclusions
23 / 24
BUSINESS STATISTICS II
Lecture – Week 15
Antonio Rivero Ostoic
School of Business and Social Sciences
 April 
AARHUS
UNIVERSITYAU
Today’s Outline
Multiple regression model
• coefficients • estimation • conditions • testing • diagnostics
Working example
(SE estimates, and fitting the model with logarithmic transformations)
2 / 17
Multiple regression model
While a simple regression analysis has a single independent variable,
in a multiple regression analysis we count with several explanatory
variables for the response variable
A multiple regression model is represented by the equation
y = β0 + β1x1 + β2x2 + · · · + βkxk +
where y is the dependent variable, x1, x2, . . . , xk are independent variables,
and is the error variable
ª note that independent variables may be product of transformations from other
variables (which are independent or not)
In this case parameters β1, β2, . . . , βk are the regression coefficients,
whereas β0 represents the intercept
3 / 17
Multiple regression model
It is important to note that the introduced multiple regression
equation represents in this case an additive model
Thus the effect of each independent variable on the response is
assumed to be the same for all values of the other predictor
ª certainly we need to assess whether the additive assumption is realistic or not
Q. Do we still considering a linear relationship in the multiple
regression model?
A. Yes, whenever the model has linear coefficients
4 / 17
Graphical representation
Multiple regression models are graphically represented by a
hyperplane with k dimensions for IVs
– for k = 2 the relationships between the IVs and the DV is
represented by a regression plane within a 3D space
– for k  2 the model is represented by a regression or
response surface, a hyperplane (2D) that is not
conceivable to visualize for us
5 / 17
Interpreting Coefficients
In the multiple regression model β0 stands for the intersection of
the regression hyperplane, and represents the mean of y when
x’s equal 0
ª it makes only sense if the range of the data includes zero
βi, i = 1, . . . , k represent the change in the DV when xi changes
one unit while keeping the other IVs constant
When is possible, interpret the regression coefficients as the
ceteris paribus effect of their variation on the dependent variable
ª i.e. “other things being equal” interpretation
6 / 17
Estimation
The estimation of the coefficients is given by the least squares
equation
ˆy = b0 + b1x1 + b2x2 + · · · + bkxk
for k independent variables
And the error variable is estimated as
e+i = yi − ˆyi
7 / 17
Required conditions
The required conditions of the error variable assumed in
a simple linear regression model remain for multiple
regression analysis
ª that is errors are independent, normally distributed with mean 0
and a constant σ
The standard error of the estimate has less df than in the
simple regression analysis
ª we want SE close to zero
8 / 17
Testing the regression model
We test the validity of the model with the following hypotheses
H0 : β1 = β2 = · · · = βk = 0
H1 : βi = 0 for at least one i
ª The model is invalid in case we fail to reject the null hypothesis, whereas
whenever the alternative hypothesis is accepted then the model has some validity
 Since in multiple regression models we count with several competing
explanatory variables for a response variable, then the assessment of
the model is central in the analysis
9 / 17
Testing the regression model
The test of significance of the model is based on the F statistics,
which means that we focus on the variation of the outcomes
The F-test is the proportion of the Mean Squares of Regression
and Residual
F =
SSR/k
SSE/n − k − 1
=
MSR
MSE
Recall that SSR represents the explained variation in the model,
whereas SSE is the unexplained variation
ª we want a high value for SSR and a low value of SSE, since this indicates
that most of the variation in the response variable is explained by the model
10 / 17
Testing the regression model
For the F-test the rejection of H0 applies when
F  Fα, k, n−k−1
ª hence for a given α level we infer difference in the regression coefficients
in case that the F statistic value fails within the rejection region
Another way to assess the model is through the coefficient of
determination or R2, which interpretation is similar to the simple
regression analysis
ª we want R2
close to one
11 / 17
Test of individual coefficients
Based on the test of significance of the multiple regression model
we can perform individual t tests for each regression coefficient
H0 : βi = 0
H1 : βi = 0 (two-tail test)
The test statistic is
t =
bi − βi
sbi
12 / 17
Test of individual coefficients
And the confidential intervals are
bi ± tα/2, n−k−1· sbi
for i = 1, . . . , k
We reject the null hypothesis iff
t  tα/2, n−k−1
(for a two-talied test)
13 / 17
Adjusted R-squared
When we add explanatory variables to the multiple regression model
we cannot decrease the value of the coefficient of determination
ª but it is possible to get a very high R2
even when the true model is not linear
Thus the adjusted R-squared is often used to summarize the multiple
fit as it takes into account the number of variables in the model
ª it is the coefficient of determination adjusted for df
Adjusted R2
= 1 −
MSE
MS Total
where MSE = SSE/(n − k − 1), and MS Total is the sample variance of y
Adjusted R2 ≤ R2
14 / 17
Regression diagnostics: multicollinearity
In addition to nonnormality and heteroskedasticity, the regression
diagnostics for a multiple model checks also for multicollinearity
Multicollinearity occurs when two or more independent variables
are highly correlated with one another
ª hence it is very difficult to separate their particular effects and influences on y
It causes inflated standard errors for estimates of regression
parameters and very large regression coefficients
Some consequences of this inflation are:
– a large variability of the samples, which causes that the sample
coefficients may be far from the population parameters, and hence
with wide confidence intervals
– small t statistics that suggest no linear relationship between involved
variables and the response variable, and such inference may be wrong
15 / 17
Multicollinearity
Multicollinearity can be avoided if one anticipates the problem
from theory or past experiences
ª multiple correlation scores can serve as a guide
Beware that two independent variables can be highly correlated
with each other (or with another predictor) but uncorrelated with
the dependent variable
ª they may be non-redundant suppressor variables
A stepwise regression (backward and forward) can serve to
minimize multicolliniearity in the modelling
ª these methods are based on improving the models fit
16 / 17
Multiple regression analysis
WORKING EXAMPLE
[Prediction of avg. Household Size in Global Cities]
Multiple regression analysis using globalcity-multiple.sav
17 / 17
BUSINESS STATISTICS II
Lecture – Week 17
Antonio Rivero Ostoic
School of Business and Social Sciences
 April 
AARHUS
UNIVERSITYAU
Today’s Outline
Model building in multiple linear regression
– predictors
Comparing regression models
Stepwise regression
Working example
– model building
– model comparison
Further issues (...)
2 / 16
Model building in multiple linear regression
The main goal in model building is to fit a model that explains
variation of the dependent variable with a small set of predictors
ª i.e. a model that efficiently forecasts the response variable of interest
When dealing with multiple independent variables, each subset of
x’s represents a potential model of explanation
ª for k predictors in the data set there are 2k
− 1 subsets of independent variables
Thus we want to establish a linear equation that predicts ‘best’ the
values of y by using more than one explanatory variable
 Recall that to obtain a good model we need a R2
score closer to 1, a
small value for SE , and a large F statistic (which implies a small SSE)
3 / 16
Predictors
There are two types of independent variables to consider, and they
correspond to the numeric and the categorical variables
– Factors characterize qualitative data
– Covariates represent quantitative data
Predictors = Factors + Covariates
 Sometimes an abstraction made on a numeric variable is called a factor that
explains the theory in the regression model, and covariate is simply a control
variable
4 / 16
Comparing Regression Models
cf. F-general in Note 2
To test of whether a model fits significantly better than a simpler model
In this case a restricted or reduced model is nested within an
unrestricted or complete model
ª i.e. one model is contained in another model
The test statistics can be based on the SSE or on the R2 values for
both models
Fchange =
(R2
U − R2
R) / df1
(1 − R2
U) / df2
where df1 = q = kU − kR (i.e. number of variable restrictions), and
df2 = n − kU − 1
5 / 16
Comparing Regression Models
F-general with sum of squares
On the other hand, by considering the sum of squares of the
residuals, the F statistics becomes
Fchange =
(SSER − SSEU) / df1
SSEU / df2
with the same df’s as before, and we take the absolute value
SPSS
We need to combine in Analyze Regression Linear the two models
with a different variable selection Method (Enter and Remove in
Blocks 1 and 2), and check R squared change in Statistics...
6 / 16
Comparing Regression Models
nested models
SPSS
The syntax procedure for comparing two nested models is..:
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA CHANGE
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT y
/METHOD=ENTER x1 x2
/METHOD=REMOVE x2.
7 / 16
Comparing Regression Models
...that for the data in Note 2 produces this outcome for both models:
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
Change Statistics
R Square
Change F Change df1 df2
Sig. F
Change
1
2
,55a
,304 ,297 67,45215 ,304 48,426 3 333 ,000
,41
b
,167 ,164 73,57910 -,137 32,811 2 333 ,000
Predictors: (Constant), years potential experience, years of education, years with current employera.
Predictors: (Constant), years of educationb.
ª the Fchange for Model 2 is for kU = 3 and kR = 1
 this statistic is also equivalent to the F score in the analysis of
variance of both models
8 / 16
Stepwise regression
Variable selection
A sequential procedure to perform multiple regressions is
found in the stepwise method
It combines forward selection of predictors and backward
elimination of the independent variables
These are bottom-up and top-down processes based on
F scores and predefined p values
ª defaults in SPSS are 5% for IN, and 10% for OUT
9 / 16
WORKING EXAMPLE
[Average Household Size in Global Cities]
Model Building
(Data in globalcity-multiple.sav)
10 / 16
Avg. household size in global cities
Model assessment
Model Summary
Model R
R
Square
Adjusted R
Square
Std. Error of
the Estimate
Change Statistics
R Square
Change F Change df1 df2
Sig. F
Change
1
2
3
4
5
,713a
,508 ,506 1,19059 ,508 239,764 1 232 ,000
,760b
,578 ,574 1,10517 ,070 38,248 1 231 ,000
,787c
,620 ,615 1,05170 ,041 25,087 1 230 ,000
,798d
,637 ,631 1,02944 ,018 11,053 1 229 ,001
,805e
,648 ,641 1,01542 ,011 7,367 1 228 ,007
Predictors: (Constant), Household Connection to Watera.
Predictors: (Constant), Household Connection to Water, Average Income Q3 Personb.
Predictors: (Constant), Household Connection to Water, Average Income Q3 Person, Overall Child Mortalityc.
Predictors: (Constant), Household Connection to Water, Average Income Q3 Person, Overall Child Mortality,
Informal Employment
d.
Predictors: (Constant), Household Connection to Water, Average Income Q3 Person, Overall Child Mortality,
Informal Employment, Percent Woman Heade of Households
e.
11 / 16
WORKING EXAMPLE
[Average Household Size in Global Cities]
Comparing nested models
12 / 16
Avg. household size in global cities
models 4 and 5
The F change in the two nested models is given in:
Model Summary
Model R
R
Square
Adjusted R
Square
Std. Error of
the Estimate
Change Statistics
R Square
Change F Change df1 df2
Sig. F
Change
1
2
,805a
,648 ,641 1,01542 ,648 84,113 5 228 ,000
,798
b
,637 ,631 1,02944 -,011 7,367 1 228 ,007
Predictors: (Constant), Percent Woman Heade of Households, Informal Employment, Average Income Q3
Person, Overall Child Mortality, Household Connection to Water
a.
Predictors: (Constant), Informal Employment, Average Income Q3 Person, Overall Child Mortality, Household
Connection to Water
b.
13 / 16
Avg. household size in global cities
the final model?
Estimate Std. Error t value Pr(|t|)
(Intercept) 5.4130 0.3705 14.61 0.0000
x10 −0.0191 0.0031 −6.13 0.0000
x3 −0.0001 0.0000 −3.95 0.0001
x5 0.0790 0.0157 5.04 0.0000
x9 0.0131 0.0041 3.18 0.0017
x6 −0.0104 0.0038 −2.71 0.0072
And what about this other one..?
y = x4 + x5 + x6 + x8 + x9 + x10
14 / 16
Further Issues
multiple regression
Comparison of separate models
Regression diagnostics
Collinearity tests
Logarithmic transformations
Interpretation of results
15 / 16
Summary  Conclusions
Find a parsimonious model that effectively explains y
Model comparison combines evaluation of the fits and the
significance of regression coefficients
ª available automated procedures
To compare nested models we use the F statistics
ª working example, and data in note 2
WORKING EXAMPLE:
“It seems that the inclusion of the ratio of woman head of households
improves the model, but does it contribute to explain the change in the
average of the household size in the global cities?”
16 / 16
BUSINESS STATISTICS II
Lecture – Week 18
Antonio Rivero Ostoic
School of Business and Social Sciences
 April 
AARHUS
UNIVERSITYAU
Today’s Outline
Polynomial regression models
Regression models with interaction
Comparing models (note 3)
Dummy variables
2 / 20
Polynomial regression
Polynomial regression is a particular case of a regression model that
produces curvilinear relationship between response and predictor
Recall that simple regression equations represent first-order models
y = β0 + β1x +
Here the order of the equation p equals 1 and the relation between
the predictor and the response is depicted by a regression line
ª the model has a ‘degree 1 polynomial’
We can have regression equations with several independent
variables that are polynomial models and still having just one
predictor variable
 Remember that when the parameters in the equation are linearly related,
then the polynomial regression model is considered as linear
3 / 20
First order and polynomial regression models
• First order model with two predictors: x1 and x2
y = β0 + β1x1 + β2x2 +
• First order model with k predictors: x1, . . . , xk
y = β0 + β1x1 + β2x2 + · · · + βkxk +
• Polynomial model with one predictor variable x and order p
y = β0 + β1x + β2x2
+ · · · + βpxp
+
ª thus a predictor variable can have various orders or powers
4 / 20
Second-order models
• A second-order (polynomial) model with a single predictor variable
has p = 2 and the equation represents a quadratic response
function depicted by a parabola
ª a ‘degree 2 polynomial’ or quadratic polynomial
y = β0 + β1x + β2x2
+
 β1 controls for translation parameter of the parabola, and β2 for its
curvature rate
5 / 20
Quadratic effect of the regression coefficient
second-order model with β2x2
β2 = 1
x
y
convex
β2 = −1
x
y
concave
6 / 20
Third-order models
• A third-order (polynomial) model with a single predictor variable
has p = 3 and the equation represents a cubic response function
and depicted as a sigmoid curve
ª a ‘degree 3 polynomial’
y = β0 + β1x + β2x2
+ β3x3
+
 there are three regression coefficients that control for two
curvatures
7 / 20
Cubic effect of the regression coefficients
third-order model (β1  0 and β2  0)
β3  0
x
y
β3  0
x
y
8 / 20
Higher-order models and several predictor variables
Models with order  3 are seldom used in regression analysis
ª typically because of the overfitting in the model and the poor prediction power
However, so far we have seen multiple regression equations
involving several predictors that are related in an additive model
ª that is, the effect of each IV was not influenced by the other variables
As illustration, consider a monomial model with two predictors (from
the WORKING EXAMPLE)
y = 5.47 − .03 x10 + .02 x9
(avg. household size as a function of access to water and informal employment)
for x9 = 1 then ˆy = 5.49 − .03 x10
for x9 = 50 then ˆy = 6.47 − .03 x10
for x9 = 99 then ˆy = 7.45 − .03 x10
9 / 20
Additive model with 2 predictors
0 20 40 60 80 100
23456789
x
y
y^ = 5.49 + −0.03x
10 / 20
Additive model with 2 predictors
0 20 40 60 80 100
23456789
x
y
y^ = 5.49 + −0.03x
y^ = 6.47 + −0.03x
11 / 20
Additive model with 2 predictors
0 20 40 60 80 100
23456789
x
y
y^ = 5.49 + −0.03x
y^ = 6.47 + −0.03x
y^ = 7.45 + −0.03x
12 / 20
Comparing models
Note 3
Four models: (1) first order; (2) second order; (3) linear-log; (4) log-linear
a) The t test is used to compare models (1) and (2)
ª since (1) is the reduced version of (2) we can use the Fchange score for nested
models where t =
√
F
b) Models (1) and (3) are not nested; we choose one with a better fit
c) Models (2) and (3) are neither nested and we rely on R
2
since they have a
different number of predictors (performances are almost identical here...)
d) Comparing a log-linear model with an untransformed response requires
another approach and it is out of the scope...
13 / 20
Regression models with interaction
Many times the effect of a certain explanatory variable on the
response is affected by the value of another predictor of the model
In such cases there is an interaction between the two predictors,
and the influence of these variables on y does not operate in a
simple additive pattern
A first order model with interaction:
y = β0 + β1 x1 + β2 x2 + β3 x1 x2 +
where the effect of x1 on the response is influenced by x2 and vice-versa
An interaction exists in the regression model when a regression
coefficient varies with a different value of another coefficient
ª not easy to interpret
14 / 20
Example
A model with two the predictors and interaction from the
WORKING EXAMPLE
y = 6.58 − .04 x10 + .00 x9 + .00 x10 x9
produces no interaction because in the model b3 equals zero
ª this may be explained by the high correlation between y and x9
15 / 20
Estimating multiple regression with interaction
An important concern with multiple regression is that lower order
variables are highly correlated with their interactions
Centering and standardization of predictors correct this problem
ª Centering implies re-scaling the predictors by subtracting the mean from each
observation, and by dividing the centering scores with the standard deviation of the
variable we standardize the predictors
Model with interaction from the WORKING EXAMPLE with standardized
values
y = 1.11 − .50 x10 + .35 x9 + .16 x10 x9
for x9 = 1 then ˆy = 1.46 − .34 x10
for x9 = 2 then ˆy = 1.81 − .18 x10
 which means that the fitted lines are not parallel as with the additive model
16 / 20
Higher order models with interaction
Higher order models with interaction produce quadractic, cubic
(W, M or other shape) relationships between the response and
each of the predictors
Model with a quadratic relationship and interaction
y = β0 + β1x1 + β2x2 + β3x2
1 + β4x2
2 + β5x1x2 +
will produce parabolas with crossing trajectories...
17 / 20
Regression with dummy variables
Until now we have been doing regression analysis using interval
scales of the data only
However in many cases we may count with qualitative data that are
represented by a nominal scale, and treating this type of data as
interval brings misleading results
We can perform regression analysis by using dummy or indicator
variables, which are artificial variables that encode the belonging or
not of an observation to a certain group or category
ª code 1 for belonging, and code 0 otherwise
Indicator or dummy variables are just for classification purposes and
the magnitude used is not applicable in this context
18 / 20
Regression with dummy variables
For 3 categories we use 2
indicator variables
I 1 I 2
Category 1 1 0
Category 2 0 1
Category 3 0 0
For 4 categories we use 3
indicator variables...
I 1 I 2 I 3
Category 1 1 0 0
Category 2 0 1 0
Category 3 0 0 1
Category 4 0 0 0
 How many dummies are required for a variable having two categories?
19 / 20
Dummies with command-line
We need to create a number of dummy variables according to
the existing number of categories.
Syntax in SPSS:
RECODE varlist_1 (oldvalue=newvalue) ... (oldvalue=newvalue)
[INTO varlist_2].
[/varlist_n].
EXECUTE.
20 / 20
BUSINESS STATISTICS II
Lecture – Week 19
Antonio Rivero Ostoic
School of Business and Social Sciences
 May 
AARHUS
UNIVERSITYAU
Today’s Outline
Qualitative Variables
Regression Models: Testing and Interpreting Results
• indicators
• multiple
• interaction
• (polynomial)
• logarithmic transformations
2 / 24
Qualitative independent variables
The effects of qualitative information on a response variable may
be an important result, and we need ways to include this type of
data in a regression model
Qualitative information correspond to a nominal scale that my
require a pre-coding of the data into artificial variables known as
dummies or indicator variables
Recall that a nominal scale includes different categories or
groups that serve to classify the observations, and qualitative
predictors are factors
A dichotomous factor has two categories (e.g. gender), whereas
a polytomous factor has more categories (e.g. seasons)
3 / 24
Indicator variables (dummies)
Indicator variables have only two values, typically 1 and 0, and for
m categories in the variable, we require m − 1 indicator variables
ª this means that there is an omitted category in the representation to avoid
redundancy
Ii =
1 if obs. belongs to a category ci
0 otherwise.
The omitted category represents the baseline or ‘reference’
category to which we compare the other groups
ª the decision to choose the omitted category is arbitrary, and it leads to the
same conclusion
If we do not omit one category and include indicator variables for
all categories in the regression model, then there is a perfect
multicollinearity among these independent variables
ª a phenomenon known as the dummy variable trap
4 / 24
Dataset for Notes 3 and 4
training data337.sav
Dependent variable: Wage, average hourly earnings (DKK)
Independent variables: Educ, education (years)
Tenur, current employment (years)
Exper, potential experience (years)
Female, gender (0: male, 1: female)
(Male, gender (0: female, 1: male))
5 / 24
Simple regression with an indicator variable
(dichotomous factor)
“The gender wage gap”
Are women paid less than men according to the data?
Wage = β0 + β1 Female +
Estimate Std. Error t value Pr(|t|)
(Intercept) 161.9242 5.5013 29.43 0.0000
Female -62.8700 8.1117 -7.75 0.0000
 Women earn 62.87 DKK per hour less than men
6 / 24
Simple regression with an indicator variable II
(dichotomous factor)
For a variable Male = 1 − Female, and the model:
Wage = β0 + β1 Male +
we get the following results:
Estimate Std. Error t value Pr(|t|)
(Intercept) 99.0542 5.9612 16.62 0.0000
Male 62.8700 8.1117 7.75 0.0000
 Likewise men earn 62.87 DKK per hour more than women
7 / 24
The dummy variable trap
What about this model?:
Wage = β0 + β1Female + β2Male +
In this case there is a duplicated category and the independent
variables are perfectly multicollinear
Male is an exact linear function of Female and of the intercept
ª Male = 1 − Female implies that Male + Female = 1
8 / 24
Multiple Regression with a dichotomous indicator variable
(factor and covariates)
An additive dummy-regression model:
Wage = β0 + β1Female + β2Educ + β3Tenure +
• (We already know that the model fit or R2
never decreases when we
add to the model new independent variables)
• The model now assumes that – besides gender – there is an effect
of education and tenure on the wage levels
• Since the model is additive the predictors are independent to each
other, and the regression equation fits identical slopes for all the
categories in gender and for the other predictors as well
ª which implies parallel regression lines in the scatterplot
9 / 24
Testing partial coefficients
For model:
Wage = β0 + β1Female + β2Educ + β3Tenure +
Test the partial effect of gender:
H0 : β1 = 0
H1 : β1 = 0
Test the partial effect of education:
H0 : β2 = 0
H1 : β2 = 0
Test the partial effect of tenure:
H0 : β3 = 0
H1 : β3 = 0
10 / 24
Testing partial coefficients
The t-test is the coefficient divided by the SE of the estimate
ti =
bi − βi
sbi
Estimate Std. Error t value Pr(|t|)
(Intercept) -49.2529 20.5869 -2.39 0.0173
Female -46.7547 7.1544 -6.54 0.0000
Education 13.9233 1.4564 9.56 0.0000
Tenure 3.2485 0.4729 6.87 0.0000
11 / 24
Fitted values by gender: Additive model
Wage = β0 + β1Female + β2Educ + β3Tenure +
years of education
18161412108
UnstandardizedPredictedValue
300,00000
200,00000
100,00000
,00000
Fit line for Total
Female
Male
R2 Linear = 0,435
Linear Regression
12 / 24
Fitted values by gender: Additive model
Wage = β0 + β1Female + β2Educ + β3Tenure +
years of education
18161412108
UnstandardizedPredictedValue
300,00000
200,00000
100,00000
,00000
Female
Male
Male: R2 Linear = 0,575
Female: R2 Linear = 0,715
Linear Regression
13 / 24
Multiple regression with interaction: factor and covariate
(indicator variable and continuous variable)
• Many times the additive models are unrealistic, and theory suggest
different slopes for different categories
• To capture such difference in slopes we assume statistical interaction
among independent variables
Wage = β0 + β1Female + β2Educ + β3(Female × Educ) +
Estimate Std. Error t value Pr(|t|)
(Intercept) -18.1088 26.1498 -0.69 0.4891
Female -23.9223 41.7171 -0.57 0.5667
Educ 13.7154 1.9550 7.02 0.0000
Female × Educ -2.6485 3.1844 -0.83 0.4062
ª The effect of gender on wage is influenced by education and vice-versa (no sig.)
14 / 24
Fitted values by gender: Interaction model
Wage = β0 + β1Female + β2Educ + β3(Female × Educ) +
years of education
18161412108
UnstandardizedPredictedValue
250,00000
200,00000
150,00000
100,00000
50,00000
,00000
Female
Male
Male: R2 Linear = 1
Female: R2 Linear = 1
Linear Regression
15 / 24
Testing interaction
We can test for interaction in the model
Wage = β0 + β1Female + β2Educ + β3(Female × Educ) +
The null hypothesis is that there is no interaction in the model, i.e.
H0 : β3 = 0
H1 : β3 = 0
We apply now the F-general (or F incremental) statistics...
Fchange =
(R2
U − R2
R) / df1
(1 − R2
U) / df2
where df1 = q = kU − kR (i.e. number of variable restrictions), and
df2 = n − kU − 1
ª In this case the complete or unrestricted model has the statistical
interaction term whereas the reduced model does not have this term
16 / 24
Testing interaction
In an additive dummy-regression model it is possible to test for effect
of categorical variable on the response controlling for a quantitative
predictor, and vice-versa ( i.e. test for effect of a covariate controlling
for factor)
 e.g. test gender on wage controlling for education, and test
education controlling for gender
In such cases the null hypothesis is that the coefficient of the variable
to be tested equals zero
17 / 24
Multiple Regression with a polytomous indicator variable
Data from Keller xm16-02.sav
A polytomous indicator variable has more than two categories:
Price = β0 + β1Odometer + β2Color +
I1 =
1 if colour is white
0 otherwise.
I2 =
1 if colour is silver
0 otherwise.
• The reference category is ‘all other colours’ that is represented
whenever I1 = I2 = 0
18 / 24
Multiple Regression with a polytomous indicator variable
• In a multiple regression with a polytomous indicator variable we obtain
coefficients each group except for the reference category
Estimate Std. Error t value Pr(|t|)
(Intercept) 16.8372 0.1971 85.42 0.0000
Odometer -0.0591 0.0051 -11.67 0.0000
White 0.0911 0.0729 1.25 0.2143
Silver 0.3304 0.0816 4.05 0.0001
• The t-test is adequate for the covariate (i.e. odometer), but for color we
prefer to test the two indicator variables simultaneously, and this is
because the election of the reference category is arbitrary
ª the F test allow us to do this
• Part of the interpretation of the results assumes that one or more
indicator variables equal 0
19 / 24
EXERCISE: MULTIPLE REGRESSION WITH A
POLYTOMOUS INDICATOR VARIABLE
[MBA data from Keller xm18-00.sav]
20 / 24
Interpreting Results
Recall that the interpretation in regression analysis is on average, it
considers the units of measure of the involved variables, and in additive
models is by holding constant the values of the other variables
(including the error)
In regression with indicator variables the coefficients corresponding to
these variables represent a variation on the response with respect to
the other groups in the model
The statistical significance of the regression coefficients comes after the
interpretation of their effects on the response and not alone
The conclusions should account for the values of the regression
coefficients and the statistical significance of these outcomes
21 / 24
Interpreting logarithmic transformations
log is a natural logarithm base e
In Note 3 models (3) and (4) have logarithmic transformations on
variables, and we will see how to interpret the results in these models
Model (3), level-log
Wage = β0 + β1Educ + β2 log(Tenure) +
In this linear-log model, a one percent increase in years of experience
(tenure) leads to β2/100% change in wage
unit ∆Wage
% ∆Tenure
= b2/100
ª Since b2 = 31.32, then –holding education constant– a one percent change in
tenure is associated with 0.3132 DKK increase hourly in wage on average
22 / 24
Interpreting logarithmic transformations
Model (4), log-level
log(Wage) = β0 + β1Educ + β2Tenure +
This is a log-linear model where a one unit increase in the predictor
leads to bi ∗ 100% change on wage
% ∆Wage
unit ∆xi
= bi ∗ 100
• Holding education constant, a one year increase with current employer is
associated with 2.5% increase in wage per hour on average
• Holding tenure constant, 1 year more of education is associated with 10.4%
increase in wage per hour on average
23 / 24
Interpreting logarithmic transformations
log-log models are interpreted as elasticity
i.e. the ratio percent change in one variable to the percent
change in another variable
% ∆y
% ∆xi
• One percent change in xi is associated with bi% change in y
(ceteris paribus)
 partial elasticity when we hold constant other variables
24 / 24
BUSINESS STATISTICS II
Lecture – Week 20
Antonio Rivero Ostoic
School of Business and Social Sciences
 May 
AARHUS
UNIVERSITYAU
Today’s Outline
Exam 2013
• Comparing groups (Q1)
• Regression analysis (Q3 and Q6)
2 / 27
Basic terminology
data...
quantitative = interval
qualitative = categorical, nominal
for regression...
Dependent variable, y = response variable; prediction; predicted
variable, ˆy
Independent variable, x = explanatory variable; predictor;
factor (qualitative), covariate (quantitative)
3 / 27
Exam 2013
The exam 2013 had 8 questions, and some were based on a
single data set
The data set contained 13 labor market related variables
(though one transformed) among 762 observations form men
and women
ª however not all variables were needed to answer the questions
 After you read carefully the instructions, check the data with the
software, and put labels to the variables with the provided
descriptions and units of measure (if specified)
4 / 27
Comparing groups
Q1a) Do wages differ by gender?
• Implied variables: Wage K, and gender B
• Groups to compare: Wages for men and wages for women
Plot data in SPSS
ª Plot histogram for wage grouped by gender (B)
Graphs Legacy Plots Histogram where the variable is paneled by the two
groups (optional normal curve)...
5 / 27
Comparing groups
Q1a) Do wages differ by gender?
We compare the means of these two groups through the t-test
However we need to see first whether these groups have equal
variances or not through
ª to know whether to use the pooled or the unpooled version of the t test
Thus we perform the F test for equality of variances first
Obtain basic descriptive statistics in SPSS
Analyze Reports Case Summaries... where the variable is paneled by the
two groups...
ª uncheck Display Cases and choose statistics
6 / 27
Review: F test and sample variance
H0 : σ2
1/σ2
2 = 1
H1 : σ2
1/σ2
2 = 1
F =
σ2
1/s2
1
σ2
2/s2
2
=
s2
1
s2
2
for v1 = n1 − 1 and v2 = n2 − 1
where for 1, 2, ..., n observations:
variance
s2
=
n
i=1(xi − x)2
n − 1
7 / 27
Review: t test and sample mean
independent samples and H0 : µ1 = µ2
pooled
t =
(x1 − x2) − (µ1 − µ2)
s2
p
1
n1
+ 1
n2
where s2
p =
(n1 − 1)s2
1 + (n2 − 1)s2
2
n1 + n2 − 2
unpooled
t =
(x1 − x2) − (µ1 − µ2)
s2
1
n1
+
s2
2
n2
v = n1 + n2 − 2 when σ2
1 = σ2
2
v =
s2
1/n1 + s2
2/n2
2
(s2
1/n1)2
n1−1 +
(s2
2/n2)2
n2−1
when σ2
1 = σ2
2
which for 1, 2, ..., n observations:
mean
¯x =
n
i=1 xi
n
8 / 27
F test to wages by gender
After obtained the F statistics, we check the critical values with
the respective degrees of freedom and the standard alpha value
ª use the Excel calculator or/and table for F-distribution
In this case the F ratio is within the critical region, which means
that we reject H0 of equal variances, i.e. F ratio = 1
ª the p-value indicates that the result is statistically significant
Both outcomes suggest that there evidence to infer that the ratio
of variances differ
 We know now that we can proceed with the analysis applying the
unpooled t test
9 / 27
t test to wages by gender
Q1a) Do wages differ by gender?
Although in this part the calculations are written by hand; you
can compare your results with the outcomes from SPSS
t test in SPSS
Analyze Compare Means Independent-Samples T Test... and the test variable
K is paneled by the two groups in B
ª We Define Groups... by putting 0 and 1 that characterize the gender variable
 Confidence intervals are also given in the table of the t test for
independent samples...
10 / 27
Comparing groups
Q1b) Find a 95% confidence interval for tenure by gender
• Implied variables: Tenure G, and gender B
• Groups to compare: Tenure for men and for women
 In this case is the pooled t test with the confidence interval
estimators
11 / 27
Review: Confidence intervals for t test
pooled
Confidence interval estimator of µ1 − µ2 when σ2
1 = σ2
2
(x1 − x2) ± tα/2 s2
p ·
1
n1
+
1
n2
for v = n1 + n2 − 2
12 / 27
Comparing groups
Q1c) Find a 95% CI by gender with  15 years of education
• Implied variables: Education I, and gender B
• Groups to compare: Men and women,  15 yrs. of educ.
 In this case the difference is between population proportions
13 / 27
Review: Confidence Interval of p1 − p2
(ˆp1 − ˆp2) ± zα/2
ˆp1(1 − ˆp1)
n1
+
ˆp2(1 − ˆp2)
n2
for unequal proportions, and n1ˆp1, n1(1 − ˆp1), n2ˆp2, and n2(1 − ˆp2) 5
For the number of successes in the two populations, x1 and x2
ˆp1 =
x1
n1
and ˆp2 =
x2
n2
14 / 27
Test of proportion p1 − p2
As when we compared means, the calculations for proportions should be
made manually. However to find x1, x2, and n1, n2 you can use SPSS
Proportion success in SPSS
Analyze Descriptive Statistics Crosstabs... where I is contrasted by B
- Alternatively, you can create new indicator variable, say PL15 = 1 iff I 3
Transform Recode into Different Variables , and
in Old and New Values... recode to 1 the Range category 3 through 4, and 0
otherwise, after naming the new variable
Then get a report
Analyze Reports Case Summaries... where PL15 is the Variable that is
grouped by B, and specifying Number of Cases and Sum in Statistics
15 / 27
Test of proportion p1 − p2
By combining the two categorical variables, we obtain the
sample estimates for both proportions that are men and women
ª and we are able to pursue with the arithmetic calculations
For 95% confidence interval, the multiplier zα/2 is 1.96
ª the score comes from the z table in Keller for 1 − (.05/2)
16 / 27
Comparing groups
Summary  comment
Q1a. Men earn significantly more than women
Q1b. With 95% confidence the difference interval of 2.8 to 4, men
have more years of market experience than women
Q1c. With 95% confidence the difference interval of 10 to 24
percentage points, men have less schooling level than women
• Relate the implied variables of wage, tenure, and education
level for both groups
• Explain why the differences might occur in that way...
ª eventually using other variables from the data
17 / 27
Regression analysis
Q4a) Estimation and regression diagnostics
for an additive log linear regression model
Dependent variable: M, the natural logarithm of K, wage (hourly)
Independent variables: B, gender (male = 1, female = 0)
C, education (years)
G, market experience or tenure (years)
18 / 27
Regression analysis
Q4. where log = ln
The regression equation represents a log level model:
ln(Wage) = β0 + β1 Male + β2 Educ + β3 Tenure +
Estimate Std. Error t value Pr(|t|)
(Intercept) 4.4353 0.0594 74.65 0.0000
Male 0.1268 0.0132 9.63 0.0000
Educ 0.0431 0.0031 13.79 0.0000
Tenure 0.0089 0.0019 4.69 0.0000
19 / 27
Model diagnostics
Multiple regression
After performing linear regression analysis...
Check the assumptions
| x ∼ N(0, σ2
)
and evaluate multicollinearity by:
• looking at the correlation among the variables
• viewing the histogram of the standardized residuals for the model
• plotting the residuals against predicted values
20 / 27
Regression results
Q4b) Interpretation of the estimation results
The fitted model is
ˆy = 4.435 + .127 · B + .043 · C + .009 · G
This means that, men earn 12.7% more than women, and that wages
raise by 4.3% and by almost 1% for an extra year of education and
market experience respectively
ª interpretation as ceteris paribus or all things being equal
Then we interpret individual outcomes in the log level model as the
proportion of percentage change in y by a unit change in xi
21 / 27
Regression results
Q4c.
However sometimes we need fitted values in the units of measure
of the untransformed response given a set of values in the IVs
ª e.g., How much wage is expected for a man or a woman having 12 years of
education and 15 years of experience?
In this case we apply the exponential function to both hands of the
regression equation
eln(ˆy)
= e(b0+b1x1+b2x2+b3x3)
where e ≈ 2.718282
It means for the model that we obtain the value of K rather than the
value of ln(K)
22 / 27
Regression results
Q4c.
The fitted value for a man with 12 years of education, and 15 years of
market experience is
ˆy = 4.435 + .127 · 1 + .043 · 12 + .009 · 15 = 5.213
and the expected return on wage is
e5.213
= 183.64 hourly (in DKK)
On the other hand, for a woman with similar level of education and
experience the fitted value is
ˆy = 4.435 + .127 · 0 + .043 · 12 + .009 · 15 = 5.086
and the expected return on wage is
e5.086
= 161.69 hourly (in DKK)
23 / 27
Prediction interval
Manual regression analysis
Q6) Construct a 95% prediction interval for y given x = 30
Where the fitted line for n = 100 and R2 = .755 is
ˆy = 6.92 + .237x
Some descriptives for location and dispersion are:
• x = 13 and s2
x = 121
• y = 10 and s2
y = 9
And the Anova table shows:
• SSR = 672.61, df = 1
• SSE = 218.39, df = 98
24 / 27
Prediction interval
Q6.
The prediction interval for ˆy∗ | x∗ = 30
ˆy∗
x∗=30 ± tα/2, n−k−1 · MSE · 1 +
1
n
+
(x∗ − x)2
(n − 1) · s2
x∗
where
MSE = s2 =
SSE
n−k−1
ª In this case, the left hand of the prediction interval estimate correspond to the
product of the regression coefficients with k = 1
ª the multiplier can be obtained from the MS Excel calculator for the t distribution
25 / 27
Prediction interval
Q6. without Anova table
It is also possible to calculate SSE from the sample variances and R2
SSE = (n − 1) s2
y −
s2
xy
s2
x
where the covariance s2
xy = R2
· s2
x · s2
y (this is from R2
=
s2
xy
s2
xs2
y
)
Or alternatively:
SSE = SSy · (1 − R2
)
where SSy = (n − 1) · s2
y
 Thus there are various possibilities fot the calculations...
26 / 27
Thank you

Good luck!

More Related Content

What's hot

Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDouglas Joubert
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsSr Edith Bogue
 
Das20502 chapter 1 descriptive statistics
Das20502 chapter 1 descriptive statisticsDas20502 chapter 1 descriptive statistics
Das20502 chapter 1 descriptive statisticsRozainita Rosley
 
Basic Stat Notes
Basic Stat NotesBasic Stat Notes
Basic Stat Notesroopcool
 
Statistics is the science of collection
Statistics is the science of collectionStatistics is the science of collection
Statistics is the science of collectionWaleed Liaqat
 
Descriptive Statistics, Numerical Description
Descriptive Statistics, Numerical DescriptionDescriptive Statistics, Numerical Description
Descriptive Statistics, Numerical Descriptiongetyourcheaton
 
Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)HennaAnsari
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsAnand Thokal
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statisticsDhwani Shah
 
General Statistics boa
General Statistics boaGeneral Statistics boa
General Statistics boaraileeanne
 
Statistics Class 10 CBSE
Statistics Class 10 CBSE Statistics Class 10 CBSE
Statistics Class 10 CBSE Smitha Sumod
 
Aed1222 lesson 2
Aed1222 lesson 2Aed1222 lesson 2
Aed1222 lesson 2nurun2010
 
Statistics Math project class 10th
Statistics Math project class 10thStatistics Math project class 10th
Statistics Math project class 10thRiya Singh
 
Statistical Methods
Statistical MethodsStatistical Methods
Statistical Methodsguest2137aa
 
Aed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd partAed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd partnurun2010
 

What's hot (20)

Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
 
Day 3 descriptive statistics
Day 3  descriptive statisticsDay 3  descriptive statistics
Day 3 descriptive statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Das20502 chapter 1 descriptive statistics
Das20502 chapter 1 descriptive statisticsDas20502 chapter 1 descriptive statistics
Das20502 chapter 1 descriptive statistics
 
Basic Stat Notes
Basic Stat NotesBasic Stat Notes
Basic Stat Notes
 
Statistics is the science of collection
Statistics is the science of collectionStatistics is the science of collection
Statistics is the science of collection
 
Descriptive Statistics, Numerical Description
Descriptive Statistics, Numerical DescriptionDescriptive Statistics, Numerical Description
Descriptive Statistics, Numerical Description
 
Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)
 
Statistics
StatisticsStatistics
Statistics
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
General Statistics boa
General Statistics boaGeneral Statistics boa
General Statistics boa
 
Tabular and Graphical Representation of Data
Tabular and Graphical Representation of Data Tabular and Graphical Representation of Data
Tabular and Graphical Representation of Data
 
Statistics Class 10 CBSE
Statistics Class 10 CBSE Statistics Class 10 CBSE
Statistics Class 10 CBSE
 
Aed1222 lesson 2
Aed1222 lesson 2Aed1222 lesson 2
Aed1222 lesson 2
 
Statistics Math project class 10th
Statistics Math project class 10thStatistics Math project class 10th
Statistics Math project class 10th
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Statistical Methods
Statistical MethodsStatistical Methods
Statistical Methods
 
Aed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd partAed1222 lesson 6 2nd part
Aed1222 lesson 6 2nd part
 
Panel slides
Panel slidesPanel slides
Panel slides
 

Similar to Business statistics-ii-aarhus-bss

manecohuhuhuhubasicEstimation-1.pptx
manecohuhuhuhubasicEstimation-1.pptxmanecohuhuhuhubasicEstimation-1.pptx
manecohuhuhuhubasicEstimation-1.pptxasdfg hjkl
 
Advanced Econometrics L1-2.pptx
Advanced Econometrics L1-2.pptxAdvanced Econometrics L1-2.pptx
Advanced Econometrics L1-2.pptxakashayosha
 
Presentation1 quality control-1.pptx
Presentation1 quality control-1.pptxPresentation1 quality control-1.pptx
Presentation1 quality control-1.pptxrakhshandakausar
 
Managerial Economics - Demand Estimation (regression analysis)
Managerial Economics - Demand Estimation (regression analysis)Managerial Economics - Demand Estimation (regression analysis)
Managerial Economics - Demand Estimation (regression analysis)JooneEltanal
 
8 correlation regression
8 correlation regression 8 correlation regression
8 correlation regression Penny Jiang
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.pptTanyaWadhwani4
 
ForecastingBUS255 GoalsBy the end of this chapter, y.docx
ForecastingBUS255 GoalsBy the end of this chapter, y.docxForecastingBUS255 GoalsBy the end of this chapter, y.docx
ForecastingBUS255 GoalsBy the end of this chapter, y.docxbudbarber38650
 
Newbold_chap14.ppt
Newbold_chap14.pptNewbold_chap14.ppt
Newbold_chap14.pptcfisicaster
 
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning,  Chapter 2- Supervised LearningCourse Title: Introduction to Machine Learning,  Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning, Chapter 2- Supervised LearningShumet Tadesse
 
Managerial Economics (Chapter 5 - Demand Estimation)
 Managerial Economics (Chapter 5 - Demand Estimation) Managerial Economics (Chapter 5 - Demand Estimation)
Managerial Economics (Chapter 5 - Demand Estimation)Nurul Shareena Misran
 
An econometric model for Linear Regression using Statistics
An econometric model for Linear Regression using StatisticsAn econometric model for Linear Regression using Statistics
An econometric model for Linear Regression using StatisticsIRJET Journal
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis pptElkana Rorio
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionKhalid Aziz
 
Linear functions and modeling
Linear functions and modelingLinear functions and modeling
Linear functions and modelingIVY SOLIS
 
Econometrics_1.pptx
Econometrics_1.pptxEconometrics_1.pptx
Econometrics_1.pptxSoumiliBera2
 
5.0 -Chapter Introduction
5.0 -Chapter Introduction5.0 -Chapter Introduction
5.0 -Chapter IntroductionSabrina Baloi
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11Bonnie Green
 

Similar to Business statistics-ii-aarhus-bss (20)

manecohuhuhuhubasicEstimation-1.pptx
manecohuhuhuhubasicEstimation-1.pptxmanecohuhuhuhubasicEstimation-1.pptx
manecohuhuhuhubasicEstimation-1.pptx
 
Advanced Econometrics L1-2.pptx
Advanced Econometrics L1-2.pptxAdvanced Econometrics L1-2.pptx
Advanced Econometrics L1-2.pptx
 
Presentation1 quality control-1.pptx
Presentation1 quality control-1.pptxPresentation1 quality control-1.pptx
Presentation1 quality control-1.pptx
 
Managerial Economics - Demand Estimation (regression analysis)
Managerial Economics - Demand Estimation (regression analysis)Managerial Economics - Demand Estimation (regression analysis)
Managerial Economics - Demand Estimation (regression analysis)
 
8 correlation regression
8 correlation regression 8 correlation regression
8 correlation regression
 
Unit 03 - Consolidated.pptx
Unit 03 - Consolidated.pptxUnit 03 - Consolidated.pptx
Unit 03 - Consolidated.pptx
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.ppt
 
ForecastingBUS255 GoalsBy the end of this chapter, y.docx
ForecastingBUS255 GoalsBy the end of this chapter, y.docxForecastingBUS255 GoalsBy the end of this chapter, y.docx
ForecastingBUS255 GoalsBy the end of this chapter, y.docx
 
Newbold_chap14.ppt
Newbold_chap14.pptNewbold_chap14.ppt
Newbold_chap14.ppt
 
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning,  Chapter 2- Supervised LearningCourse Title: Introduction to Machine Learning,  Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Managerial Economics (Chapter 5 - Demand Estimation)
 Managerial Economics (Chapter 5 - Demand Estimation) Managerial Economics (Chapter 5 - Demand Estimation)
Managerial Economics (Chapter 5 - Demand Estimation)
 
An econometric model for Linear Regression using Statistics
An econometric model for Linear Regression using StatisticsAn econometric model for Linear Regression using Statistics
An econometric model for Linear Regression using Statistics
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis ppt
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Linear functions and modeling
Linear functions and modelingLinear functions and modeling
Linear functions and modeling
 
Regressionanalysis
RegressionanalysisRegressionanalysis
Regressionanalysis
 
Econometrics_1.pptx
Econometrics_1.pptxEconometrics_1.pptx
Econometrics_1.pptx
 
5.0 -Chapter Introduction
5.0 -Chapter Introduction5.0 -Chapter Introduction
5.0 -Chapter Introduction
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
 

Recently uploaded

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 

Recently uploaded (20)

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 

Business statistics-ii-aarhus-bss

  • 1. BUSINESS STATISTICS II PART II: Lectures Weeks 11 – 19 Antonio Rivero Ostoic School of Business and Social Sciences  March −  May  AARHUS UNIVERSITYAU
  • 2. BUSINESS STATISTICS II Lecture – Week 11 Antonio Rivero Ostoic School of Business and Social Sciences  March  AARHUS UNIVERSITYAU
  • 3. Today’s Outline Simple regression analysis Estimation in a simple regression model (we use now SPSS) 2 / 28
  • 4. Introduction Galton (Darwin’s half-cousin) found in his observations that: – For short fathers, on average the son will be taller than his father – For tall fathers, on average the son will be shorter than his father Then he characterized these results with the notion of the “regression to the mean” Pearson and Lee took Galton’s law about the relationship between heights of children and parents, and came up with the regression line: son’s height = 33.73 + .516 × father’s height ª This equation shows that for additional inch of father’s height the son’s height increases on average by .516 3 / 28
  • 5. Regression Analysis Regression analysis is used to predict one variable on the basis of other variables ª i.e. to forecasting It serves from a model that describes the relationship between a variable to estimate and the variables that influences this variable – Response variable is called dependent variable, y – Explanatory variables are called independent variables, x1, x2, . . . , xk Correlation analysis serves to determine whether a relationship exists or not between variables Does regression imply causation? 4 / 28
  • 6. Model A model comprises mathematical equations that accurately describes the nature of the relationship between DV and IVs Example for a deterministic model: F = P(1 + i)n where F = future value of an investment P = present value i = interest rate per period n = number of periods ª In this case we determine F from the values on the equation’s right hand 5 / 28
  • 7. Probabilistic model However, deterministic models can be sometimes unrealistic, since other variables that are unmeasurable and not known can influence the dependent variable Such types of variables represent uncertainty in real life and it should be included in the model In this case we rather use a probabilistic model in order to incorporate such randomness A probabilistic model then incorporates and unknown parameter called the error variable ª it accounts for all measurable and immeasurable variables that are not part of the model 6 / 28
  • 8. Simple linear regression model i.e. First Order model y = β0 + β1x + where y = dependent variable x = independent variable β = coefficients β0 = y-intercept β1 = slope of the line (rise/run) or (∆Y/∆X) = error variable ª Coefficients are population parameters, which need to be estimated ª The assumption is that the errors are normally distributed 7 / 28
  • 9. Expected values and variance for y The expected value for y it is a linear function of x, and y differs from its expected value by a random amount ª linear regression is a probabilistic model For x∗ = a particular value of x: E(y | x∗ ) = µy|x∗ (mean) V(y | x∗ ) = σ2 y|x∗ (variance) 8 / 28
  • 10. Estimating the Coefficients We estimate the coefficients as we estimated population parameters That is draw a random sample from the population and calculate sample statistics But here the coefficients are part of a straight line, and we need to estimate the line that represents ‘best’ the sample data points Least squares line ˆy = b0 + b1x here b0 = y-intercept, b1 = slope, and ˆy is the fitted value of y 9 / 28
  • 11. Least squares method cf. chap. 4 in Keller The least square method is an objective procedure to obtain a straight line, where the sum of squared deviations between the points and the line is minimized n i=1 (yi − ˆyi)2 The least squares line coefficients b1 = sxy s2 x b0 = y − b1x 10 / 28
  • 12. Least squares line coefficients For b1 and b0 sxy = n i=1(xi − x)(yi − y) n − 1 s2 x = n i=1(xi − x)2 n − 1 x = n i=1 xi n y = n i=1 yi n 11 / 28
  • 13. Least squares line coefficients This actually means that the values of ˆy on average come closest to the observed values of y There are shortcut formula for b1 (check sample variance pp. 110, and sample covariance pp.127) b0 and b1 are unbiased estimators of β0 and β1 12 / 28
  • 14. EXAMPLE 16.1 Annual Bonus and Years of Experience Determine the straight-line relationship between annual bonus and years of experience 13 / 28
  • 15. Working with SPSS In SPSS we distinguish two main working windows: 1) Data Editor, where the raw data and variables are displayed 2) Statistics Viewer, where scripts and reports are provided Both windows have: MENU SUBMENU ... COMMAND Each command corresponds to a function that bears one or several ARGUMENTS 14 / 28
  • 16. Working with SPSS Command-line like It is also possible to work directly with the functions Example of the script for a regression: REGRESSION /DEPENDENT dependent-variable /ENTER List-of.independents. SPSS distinguishes between COMMANDS, FILES, VARIABLES, and TRANSFORMATION EXPRESSIONS 15 / 28
  • 17. Data Editor in SPSS Analyze Regression Linear 16 / 28
  • 18. Report in SPSS GET FILE='C:auspssxm16-01.sav'. DATASET NAME DataSet1 WINDOW=FRONT. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Bonus /METHOD=ENTER Years. Regression Notes Output Created Comments Input Data Active Dataset Filter Weight Split File N of Rows in Working Data File Missing Value Handling Definition of Missing Cases Used Syntax 06-MAR-2014 12:45:25 C:auspssxm16-01.sav DataSet1 none none none 6 User-defined missing values are treated as missing. Statistics are based on cases with no missing values for any variable used. REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Bonus /METHOD=ENTER Years. 17 / 28
  • 19. Regression Report in SPSS Variables Entered/Removeda Model Variables Entered Variables Removed Method 1 Yearsb . Enter Dependent Variable: Bonusa. All requested variables entered.b. Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate 1 ,701a ,491 ,364 4,503 Predictors: (Constant), Yearsa. ANOVAa Model Sum of Squares df Mean Square F Sig. 1 Regression Residual Total 78,229 1 78,229 3,858 ,121b 81,105 4 20,276 159,333 5 Dependent Variable: Bonusa. Predictors: (Constant), Yearsb. Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig.B Std. Error Beta 1 (Constant) Years ,933 4,192 ,223 ,835 2,114 1,076 ,701 1,964 ,121 Dependent Variable: Bonusa. 18 / 28
  • 20. Regression Plot from SPSS Graphs Legacy Dialogs Scatter/Dot... Simple Scatter Years 654321 Bonus 20 15 10 5 0 y=0,93+2,11*x R2 Linear = 0,491 19 / 28
  • 21. Calculation of Residuals The deviations of the actual data points to the line are the residuals, which represents observations of ei = yi − ˆyi In this case the sum of squares for error (SSE) represents the minimized sum of squared deviations ª basis for other statistics to assess how well the linear model fits the data The standard error of the estimate is the square root of the proportion of SSE and the number of observations ª Remember that in SPSS the value of the residuals is given in the Anova table of the regression report 20 / 28
  • 22. Annual bonus and years of experience q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus 21 / 28
  • 23. Annual bonus and years of experience: Residuals q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus 2.9524 −4.1619 1.7238 −4.3905 5.4952 −1.619 22 / 28
  • 24. Regression examples Finance/economy: – The enterprise equity value and total sales – Number of VP executives and total assets – Quantity of new houses and amount of jobs created in a city – Amount of bananas harvest and the density of banana trees per km2 Social/health: – Number of violent crime and the poverty rate – Amount of infectious diseases and population growth – Amount of diseases from chronic illnesses and urbanization level – Number of kinds raised and the number of spouses 24 / 28
  • 25. Regression examples Miscellaneous: – IQ score development and the average global temperature per year – If a horse can run X mph, how fast will his offspring run? – Number of cigarettes smoked and number of chats having with people – Number of cigarettes smoked and time at the hospital ª (more politically correct!) That is, questions like: – For any set of values on an independent variable, what is my predicted value of a dependent variable? – If an independent variable raises its value by 1-unit, how the dependent variable results? 25 / 28
  • 26. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] We do examples with random numbers... 26 / 28
  • 27. Generating Random Numbers in SPSS Variable View: Create two variables for integers Data View: Choose number of observations in each variable Transform Compute Variable Arguments: Variable names in Target Variable, and Random Numbers in Function group Choose uniform rv and establish the range of the obs. values 27 / 28
  • 28. Summary Simple linear regression analysis is for the relationship between two interval variables The assumption is that the variables are linearly connected The intercept and the slope of the regression line are the coefficients to be estimated The least squares method produces estimates of these population parameters 28 / 28
  • 29. BUSINESS STATISTICS II Lecture – Week 12 Antonio Rivero Ostoic School of Business and Social Sciences  March  AARHUS UNIVERSITYAU
  • 30. Today’s Outline Review simple linear regression analysis Error variable in regression Model Assessment – standard error of estimate – testing the slope – coefficient of determination – other measures 2 / 26
  • 31. Review Simple Linear Regression Analysis Simple regression analysis serves to predict the value of a variable from the value of another variable A lineal regression model describes the variability of the data around the regression line The observations on a dependent variable y is a linear function of the observation on an independent variable x The population parameters are expressed in in two coefficients, the y-intercept and the slope of the line, which need to be estimated, plus a stochastic part ª y-intercept: the value of y when x equals 0 ª slope: the change in y for one-unit increase in x 3 / 26
  • 32. The Error Variable Remember that in probabilistic models we need to account for unknown and unmeasurable variables that represent noise or error The error variable is critical in estimating the regression coefficients – to establish whether there is a relationship between the dependent and independent variables via an inferential method – to estimate and predict through a regression equation Errors are independent to each other and this variable is normally distributed with mean 0 and standard deviation σ ª This is expressed as ∼ N(0, σ ) 4 / 26
  • 33. Expected values of y The dependent variable can be considered as a random variable normally distributed with expected values E(y) = β0 + β1x (mean) σ(y) = σ (standard deviation) Thus the mean of y depends on the value of the independent variable, whereas its standard deviation don’t shape of the distribution remains, but E(y) changes according to x 5 / 26
  • 34. Experimental data and Observations We have been typically working with examples based on observations However it is also possible to perform a controlled trial where we generate experimental data Regression analysis works with both types of data, since the main goal is to determinate how the IV is related to the DV For observations both variables are random, which joint probability is characterized by the bivariate normal distribution ª here the z dimension is a joint density function of the two variables These types of normality conditions are assumptions for the estimations in a simple linear regression model 6 / 26
  • 35. Assessing the Model We use the least squares method to produce the best straight line But a straight line may not be the best representation of the data We need to assess how well the linear model fits the data Methods to assess the model: – standard error of estimate – the t-test of the slope – the coefficient of determination all based on the SSE 7 / 26
  • 36. Standard error of estimate Recall the error variable assumptions: ∼ N(0, σ ) And the model is considered poor if σ is large, and it is considered perfect when the value is 0 Unfortunately we do not know this parameter, and we need to estimate σ from the sample data The estimation is based on the sum of squares for error (SSE) ª which is the minimized sum of squared deviations between the points and the regression line SSE = n i=1 (yi − ˆyi)2 = (n − 1) s2 y − s2 xy s2 x 8 / 26
  • 37. Standard error of estimate The standard error of estimate is the approximation of the conditional standard deviation of the dependent variable ª that is, the square root of the residual sum of squares divided by the number of degrees of freedom s = SSE n − 2 This is the square root of s2, which in fact is the MSE ª the df is actually number of cases − number of unknown parameters IN THE SPSS REPORT: The value for s is given in the Model Summary table for a linear regression analysis 9 / 26
  • 38. Testing the slope In this case we test whether or not the dependent variable is not linearly related to the independent variable ª this means that no matter what value x has, we would obtain the same value for ˆy In other words, the slope of the line represented by β1 equals zero, and this corresponds to a horizontal line in the plot 10 / 26
  • 39. Testing the slope: Uniform distribution with β1 = 0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q qq q q q q q q q q q q q q q q q q q q q x y 11 / 26
  • 40. Testing the slope If our null hypothesis is that there is no a linear relationship among the dependent and independent variables, then we specify H0 : β1 = 0 H1 : β1 = 0 (two-tail test) If we do not reject H0, we either committed a Type II error (wrongly accepting the null hypothesis), or there is not much of a ‘linear’ relationship between the independent variable and the dependent variable However the relationship can be a quadratic, which corresponds to a polynomial regression ª In case we want to check for a positive (β1 0) or a negative (β1 0) linear relationship among the IV and DV, then we perform a one-tail test 12 / 26
  • 41. Quadratic relationship with β1 = 0 β1 = 0 x y a quadratic model: y = β0 + β1x + β2x2 + 13 / 26
  • 42. Estimator and sampling distribution For drawing inferences, b1 as an unbiased estimator of β1 E(b1) = β1 with an estimated SE sb1 = s (n − 1)s2 x that is based on the sample variance of x 14 / 26
  • 43. Estimator and sampling distribution If ∼ N(0, σ ) with values independent to each other, then we use the t-statistics sampling distribution Test statistics for β1 t = b1 − β1 sb1 Thus the t-statistic values are proportion of coefficients to their SE IN THE SPSS REPORT: The t-statistic values are given in the Coefficients table of the linear regression analysis 15 / 26
  • 44. Estimator and sampling distribution Confidential Interval estimator of β1 b1 ± tα/2 sb1 Test statistics and confidence interval estimators are for a Student t distribution with v = n − 2 IN SPSS: Confidence intervals are line Properties in the graph Chat Editor 16 / 26
  • 45. Coefficient of Determination To measure the strength of the linear relationship we use the coefficient of determination, R2 ª useful to compare different models R2 = s2 xy s2 xs2 y This is equal to R2 = 1 − SSE (yi − y)2 17 / 26
  • 46. Partitioning deviations in Example 16.1 i = 5 q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus 18 / 26
  • 47. Partitioning deviations in Example 16.1 i = 5 q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus x = 3.5 y = 8.33 yi = 17 xi = 5 y^ i = 11.504 19 / 26
  • 48. Partitioning deviations in Example 16.1 i = 5 q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus x = 3.5 y = 8.33 yi = 17 xi = 5 y^ i = 11.504 yi − y^ i y^ i − y yi − y xi − x 20 / 26
  • 49. Partitioning deviations in Example 16.1 i = 2 q q q q q q 0 1 2 3 4 5 6 7 051015 years bonus yi = 1 xi = 2 y^ i = 5.162 21 / 26
  • 50. Partitioning the deviations (yi − y) = (ˆyi − y) + (yi − ˆyi) The difference between yi and y is a measure of the variation in the dependent variable, and it equals to: a) the difference between ˆyi and y, which is accounted by the difference between xi and x ª the variation in the DV is explained by the changes of the IV b) and the difference between yi and ˆyi, which represents an unexplained variation in x If we square all parts of the equation, and sum over all sample points, we end up with a statistic for the variation in y total SS = explained SS + residual SS ª i.e. sum of squares for regression (SSR) and the sum of squares for error (SSE) 22 / 26
  • 51. Coefficient of Determination R2 = 1 − SSE (yi−y)2 = (yi−y)2 (yi−y)2 − SSE (yi−y)2 = (yi−y)2 − SSE (yi−y)2 = SS(Total) − SSE SS(Total) This is the proportion of variation explained by the regression model, which is the proportion of variation in y explained by x IN THE SPSS REPORT: R2 is given in the Model Summary table of the regression analysis 23 / 26
  • 52. Other measures to assess the model Correlation coefficient r = sxy sxsy We use t-test for H0 : ρ = 0 t = r n − 2 1 − r2 which is t distributed with v = n − 2 and variables bivariate distributed Calculate r in SPSS Analyze Correlate Bivariate (select variables and choose Pearson) 24 / 26
  • 53. Other measures to assess the model F-test F = MSR MSE for MSR = SSR/1 and MSE = SSE/(n − 2) This statistic is to test H0 : β1 = 0 IN THE SPSS REPORT: • F-statistic value is given in the Anova table • Value of r is in the Model Summary table, whereas the t statistics is given in the table for the Coefficients in the regression analysis 25 / 26
  • 54. Summary The error variable corresponds to the probabilistic part of the regression model ª independent values that are normally distributed with mean 0 and sd σ The standard error of estimate serves to evaluate the regression model by assessing the conditional standard deviation of the dependent variable By testing the slope we can check whether there is a linear relationship or not between the independent and the dependent variables The coefficient of determination measures the strength of the linear relationship in the regression model 26 / 26
  • 55. BUSINESS STATISTICS II Lecture – Week 13 Antonio Rivero Ostoic School of Business and Social Sciences  March  AARHUS UNIVERSITYAU
  • 56. Today’s Outline The equation of the regression model Regression diagnostics 2 / 31
  • 57. Regression Equation The regression equation represents the model, where the dependent variable is the response of an independent explanatory variable ª the model stands for the entire population After assessing the model, our next task is to estimate and predict the values of the dependent variable In this case we differentiate the average response at the dependent variable from the prediction of the dependent variable from a new observation in the independent variable 3 / 31
  • 58. Estimating a mean value and predicting an individual value If a linear model such as y = β0 + β1x is considered satisfactory for the data, then ˆy = b0 + b1x will represent the sample equation for the estimation of the model ª (Here we predict the error term to be 0) 4 / 31
  • 59. Estimating a mean value and predicting an individual value For x∗ representing a specific value of the independent variable: ˆy = b0 + b1x∗ – is the point prediction of an individual value of the dependent variable when the value of the independent variable is x∗ – is the point estimate of the mean value of the dependent variable when the value of the independent variable is x∗ 5 / 31
  • 60. Interval estimators A small p-value for H0 : β1 = 0 suggests a nonzero slope in the regression line However, for a better judgment we need to see how closely the predicted value matches the true value of y There are two interval estimators: a) Prediction interval that predicts y for a given value of x b) Confidence interval estimator that estimates the mean of y for a given value of x 6 / 31
  • 61. Prediction interval individual intervals ª Used if we want to predict a one-time occurrence for a particular value of y when x has a given value For ˆy = b0 + b1xg the prediction interval is ˆy ± tα/2,n−2 s 1 + 1 n + (xg − x)2 (n − 1)s2 x where xg is the given value of the independent variable Another way to express this CI is x∗ → ˆy∗ , which implies that for x∗ that is a new value of x (or for a tested value of x) the prediction interval for ˆy∗ is ˆy∗ ± tα/2,n−2 MSE 1 + 1 n + (x∗ − x)2 sxx 7 / 31
  • 62. Confidence interval estimator the average prediction interval For E(y) = β0 + β1x (i.e. for the mean of the dependent variable) the confidence interval estimator is ˆy ± tα/2,n−2 s 1 n + (xg − x)2 (n − 1)s2 x That is, for x∗ → ˆy∗ , the mean prediction interval for ˆy∗ is ˆy∗ ± tα/2,n−2 MSE 1 n + (x∗ − x)2 sxx ª where MSE equals to ˆσ 2 , whereas sxx is the unnormalized form of V(X) 8 / 31
  • 63. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] Data generation in SPSS • Choose your DV and IV, and number of observations. Then generate uniform random numbers: Transform Compute Variable... • Variable names in Target Variable , and Random Numbers in Function group • Select Rv.Uniform in Functions and Special Variables , and then establish the range of the observation values in Numeric Expression 9 / 31
  • 64. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] Confidence intervals of the regression model in SPSS • We perform the linear regression analysis Analyze Regression Linear • Individual confidential intervals are given in this command, where in the bottom Save we select in Prediction Intervals – the Individual option for Prediction Interval – the Mean option for the Confidential Interval Estimator Both at the usual 95% value 10 / 31
  • 65. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] Confidence intervals of the regression model in SPSS (2) • Since we have chosen Save , the confidential interval values are saved in the Data Editor ª here LMCI [UMCI] and LICI [UICI] stand respectively for Lower [Upper] Mean and Individual Confidence Interval The Variable View in the Data Editor gives the labels of the new variables 11 / 31
  • 66. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] Visualizing confidence intervals in SPSS • The visualization of both types of confidence intervals are possible after we plotted the variables Graphs Legacy Dialogs Scatter/Dot... Simple Scatter • From Elements Fit Line at Total of the graph Chart Editor, we look in the tab Fit Line (Properties) the options Mean and Individual in the Confidential Intervals section for the two CI estimators 12 / 31
  • 67. Confidence bands from SPSS Example 16.2 in Keller Odometer 50,040,030,020,010,0 Price 16,5 16,0 15,5 15,0 14,5 14,0 13,5 y=17,25+-0,07*xy=17,25+-0,07*x R2 Linear = 0,648 R2 Linear = 0,648 13 / 31
  • 68. EXAMPLE-DO-IT-YOUR-SELF [A predicted variable and a predictor variable] Predict new observations in SPSS • To forecast new observations, first we need to put the value in the dependent variable of the Data Editor • Then we choose a linear regression analysis Analyze Regression Linear • And, after we press the Save bottom, we select the Unstandardized option in Predicted Values 14 / 31
  • 69. Regression Diagnostics Here we are concern with evaluating the prediction model that includes some error or noise ei = yi − ˆyi thus the residual equals each observation minus its estimated value Recall that in regression analysis there are some assumptions made for the error variable ª errors are independent to each other that are normally distributed, and hence with a constant variance 15 / 31
  • 70. Regression Diagnostics A regression diagnostics checks for two things: a) whether or not the conditions for the error are fulfil b) for the unusual observations (those that fall far from the regression line), and determine whether or not these values results from a fault in the sampling we look at several diagnostic methods for unwanted conditions 16 / 31
  • 71. Residual analysis Residual analysis focus on the differences between the observations and the predictions made in the linear model Residual Analysis in SPSS Residual analysis is based on standardized and unstandardized residuals • After choosing linear regression analysis Analyze Regression Linear • When we press the Save bottom, we select the Standardized and Unstandardized options in Predicted Values ª Recall that these values are recorded in the Data View of the Data Editor 17 / 31
  • 72. Nonnormality The nonnormality check of the error variable is made by visualizing the distribution of the residuals ª we use the histogram for this Nonnormality in SPSS The histogram of residuals is obtained from Graphs Legacy Dialogs Histogram... • And we choose RES (which corresponds to the unstandardized residuals) for the Variable option 18 / 31
  • 73. Nonnormality Nonnormality in SPSS (2) It is also possible to obtain the distribution shape in the histogram • In the Chart Editor we go to Elements Show Distribution and choose Normal 19 / 31
  • 74. Heteroscedasticity Heteroscedasticity (or heteroskedasticity) is the term used when the assumption of equal variance of the error variable is violated ª homoscedasticity has the opposite implication, meaning ‘homogeneity of variance’ To test the heterogeneity of variance in the error variable we can plot the residuals against the predicted values of the DV ª then we look for the spreading of the points; if the variation in ei = yi − ˆyi increases as yi increases, the errors are called heteroscedastic This type of graph is sometimes called the ei − ˆyi plot 20 / 31
  • 75. Heteroscedasticity Heteroscedasticity in SPSS The heteroskedasticity condition evaluated by the ei − ˆyi plot Graphs Legacy Dialogs Scatter/Dot... Simple Scatter • And choosing RES (the unstandardized residuals) for the Y-axis, and PRE (the predicted values) for the X-axis • For the mean line of the residuals in the plot we go to the Chart Editor (by double-clicking the graph in the report) and in Options Y Axis Reference Line • Select the Mean option in the Reference Line tab of Properties 21 / 31
  • 76. Nonindependence of the Error variable The nonindependence of the errors means that the residuals are autocorrelated, i.e. correlated over time To detect autocorrelation we can plot the residuals in a time period and look for alternating or increment patterns ª If no clear pattern appears in the plot then there is an indication that the residuals are independent to each other Alternatively to detect lack of independence between errors without time laps, we can perform the Durbin-Watson test ª where the null hypothesis is that no correlation exists, whereas the alternative hypothesis is that a correlation exists; i.e. H0 : ρ = 0, and H1 : ρ = 0 we look at this test in multiple regression analysis... 22 / 31
  • 77. Nonindependence of the Error variable Nonindependence of the error variable in SPSS We now create a time variable in the EXAMPLE-DO-IT-YOUR-SELF, and then index the observations with a vector sequence Transform Compute Variable... • Index (time) variable in Target Variable , and the Miscellaneous option in Function group • Select $Casenum in Functions and Special Variables 23 / 31
  • 78. Nonindependence of the Error variable Nonindependence of the error variable in SPSS (2) After obtaining the unstandardized residuals, we plot these values... Graphs Legacy Dialogs Line... Simple • We select the Mean of the unstandardized residuals is located in the Line Represents option, and the time variable in Category Axis If we go to the Chart Editor we obtain the expected mean in Options Y Axis Reference Line 24 / 31
  • 79. Outliers Outliers are unusual (small or large) observations in the sample, which lie far away from the regression line These points may suggest: an error in the sampling, a recording mistake, an unusual observation ª we should disregard the observation if case of one of the two first possibilities To detect outliers: – we serve from scatter diagrams of the IV and DV with the regression line – we check the standardized residuals where absolute values larger than 2 may suggest an outlier 25 / 31
  • 80. Outliers Detection of outliers in SPSS First we get the standardized residuals when choosing linear regression analysis Analyze Regression Linear In the bottom Save we select the Standardized in Residuals Then we obtain the absolute values of this variable • ZRE 1 in Target Variable , and choose Arithmetic in Function group • Select Abs in Functions and Special Variables and put this variable code in the parentheses 26 / 31
  • 81. Influential Observations We serve from scatter diagrams of the IV and DV with the regression line as well to evaluate the impact of influential observations ª we produce two plots, one with and another without the supposed influential obs. Optionally, to detect influential observation we can use different measures as well: Leverage describes the influence each observed value has on the fitted value for this observation ª where Mahalanobis distance is a measure of leverage of the observation Cook’s D (distance) detects dominant observations, either outliers or observations with high leverage ª an Influence plot is made of the Studentized Residuals (ei/SE) against the leverages of the observations (called ‘hat’ values) 27 / 31
  • 82. Cook’s Distance Example 16.2 in Keller 0 20 40 60 80 100 0.000.020.040.060.080.100.12 Obs. number Cook'sdistance Cook's distance 19 74 86 28 / 31
  • 83. Influence plot (example 16.2 in Keller) Areas of the circles are proportional to Cook’s distances 0.01 0.02 0.03 0.04 0.05 0.06 0.07 −2−1012 Hat−Values StudentizedResiduals q q q qq q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q 8 19 29 / 31
  • 84. Other aspects in Regression Diagnostics • In the validation of linear model assumptions, we can also evaluate the skewness, kurtosis in the distribution shape of the residuals... • The prediction capability of the model can be assessed by looking at the predicted SSE as well (in multiple regression we also look at the collinearity among IVs) 30 / 31
  • 85. Summary For a given explanatory variable, we differentiate the individual value of the response variable from its mean value Point estimation provides individual prediction intervals of the DV, and confidence interval estimator approximates the mean of the response variable Regression diagnostics concerns with evaluating the prediction model and the assumptions of the error variable We look at the dominant points inducing the regression line for assessing the prediction model, whereas much of the diagnostics concentrates on the characteristics of the residuals 31 / 31
  • 86. BUSINESS STATISTICS II Lecture – Week 14 Antonio Rivero Ostoic School of Business and Social Sciences 1st April 2014 AARHUS UNIVERSITYAU
  • 87. Today’s Outline Scaling and transformations Standard error of estimates and standardized values Step-by-step example with simple linear regression analysis 2 / 24
  • 88. Scaling and transformations Sometimes data transformation is needed in order obtain e.g. a normal distribution Transformations are mathematical adjustments applied to scores in an attempt to make the distribution of the outcomes fit requirements Scaling (and re-scaling) is a linear transformation based on proportions where the scores are enlarged or reduced 3 / 24
  • 89. Data transformation In a simple linear regression analysis we can perform a transformation of both the explanatory and the response variables For example in linear regression we may need to transform the data: – when the residuals have a skewed distribution or they show heteroscedasticity – to linearize the relationship among the IV and the DV – but also when the theory suggest a transformed expression – or to simplify the model in a multiple regression model 4 / 24
  • 90. Scaling and transformations Examples of transformations of the variable x are: – Square root: √ x – Reciprocal: 1/x – Natural log: ln(x) or log(x) – Log 10: log10(x) In linear regression we use least squares fitting ª this transformation allows the residuals to be treated as a continuous differentiable quantity 5 / 24
  • 91. Logarithmic transformations linear regression analysis Model Linear Linear-log Log-linear Log-log Transformation None x = log(x) y = log(y) x = log(x) y = log(y) Regression equation y = β0 + β1x y = β0 + β1 log(x) log(y) = β0 + β1x log(y) = β0 + β1 log(x) ª log are natural logarithms with base e ≈ 2.72 ª The term ‘level’ is also used instead of ‘linear’ in logarithmic transformations 6 / 24
  • 92. Logarithmic transformations linear regression analysis Model Linear Linear-log Log-linear Log-log Interpretation A one unit increase in x would lead to a β1 increase/decrease in y A one percent increase in x would lead to a β1/100 increase/decrease in y A one unit increase in x would lead to a β1 ∗ 100% increase/ decrease in y A one percent increase in x would lead to a β1% increase/decrease in y ª In econometrics, log-log relationships are referred as “elastic” and the coefficient of log(x) as the elasticity 7 / 24
  • 93. Standard Error of Estimates SE = square root of the proportion of the squared differences between criterion’s predicted and observed values and the df The squared differences between criterion’s predicted and observed values corresponds to the Residual SS (SSE in Anova) ª it represents the unexplained variation in the model (or model deviance) The df equals number of cases − number predictors in the model −1 ª in a simple linear regression model there is only one predictor, and df equal n − 2 Thus most of the calculation for the SE of estimates corresponds to the Residual SS 8 / 24
  • 94. SE and Residual SS SSE in SPSS After having the data, to obtain the SSE we need first the predicted values of our model Analyze Regression Linear • And in Save choose the Unstandardized option in Predicted Values 9 / 24
  • 95. SE and Residual SS SSE in SPSS (2) Then we calculate by hand the residuals (yi − ˆyi) in a new variable created in the Variable View. We name this variable as RESID • Then we go to Transform Compute Variable... and place RESID in Target Variable , and make the subtraction operation with the expression: DV − PRE 1 10 / 24
  • 96. SE and Residual SS SSE in SPSS (3) The next step is to obtain the square of the residuals, and we the recent created variable (named RESID) for this. Thus transformation of the residual values to their squares is obtained after we place RESID is in Target Variable and type in the Numeric Expression field the square of the values: RESID ∗∗ 2 11 / 24
  • 97. SE and Residual SS SSE in SPSS (4) The sum of squares of the residuals, which is the numerator of the SE, is obtained when we sum the values of this last variable Analyze Reports Report Summaries in Columns... and choose RESID for the Data Columns and select Display grand total in Options . The Residual SS or SSE is given in the Report of the Statistics Viewer as Grand Total. ª in SPSS the SE of estimates is given in Model Summary, and the SSE and df values are in the ANOVA table 12 / 24
  • 98. Standardized values Standardized values have been transformed into a customary scale Standardized Coefficient In linear regression the standardized coefficient is the product of the regression coefficient and the proportion of the standard deviations of the DV and the IV That is Beta (in SPSS) equals B ∗ (s(x)/s(y)) The standardized coefficient represents the change in the mean of the dependent variable, in y standard deviations, for a one standard deviation increase in the independent variable 13 / 24
  • 99. Standardized values Standardized Residuals In SPSS we count with various types of residuals: – RES 1 stands for unstandardized residuals – SRE 1 stands for Studentized residuals – ZRE 1 stands for standardized residuals And Keller (pp 653) tells us about the standardization of variables in general and of the residuals in particular ª subtract the mean and divide by the standard deviation 14 / 24
  • 100. Standardized residuals We get the Excel output table with the standardized residuals for Example 16.2 (Keller, pp 653) Now let us look at the SPSS results for this data... ? Hmmmmmmmmmmmmmm.... ? 15 / 24
  • 101. Standardized residuals The term ‘standardized residual’ is not a standardized term In Keller “Standardized” residuals are residuals divided by the standard error of the estimate (residual) (cf. pp 653) However in SPSS these values (cf. Excel output pp 653) correspond to the “Studentized” residuals ª (even though the definition is for the Studentized deleted residuals) In SPSS a standardized residual is the residual divided by the standard deviation of data ª Studentized residuals (another form for standardization) have a constant variance, and combine the magnitude of the residual and the measure of influence 16 / 24
  • 102. Standardized residuals speaking the same language Residuals (unstandardized) are the difference between observations and expected values: ˆ = y − ˆy In the case of a regression model standardized residuals are normalized to a unit variance The standard deviation or the square root of the variance of the residuals corresponds to the sqrt of MSE (cf. lec. week 12) ª this is also known as the root-mean-square deviation Standardized residual = residual / √ MSE 17 / 24
  • 103. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] Be aware that in this case the model is chosen in advance, and we adopt a linear relationship between two variables 18 / 24
  • 104. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] 1. Determine the response and the explanatory variables 2. Visualize the data through a scatter plot 3. Perform basic descriptive statistics 19 / 24
  • 105. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] 4. Estimate the coefficients (intercept and slope) 5. Compute the fitted values and the residuals 6. Obtain the sum of squares for errors (Residual SS) 20 / 24
  • 106. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] 7. Estimate the coefficients (intercept and slope) a) standard error of estimate b) test of the slope c) coefficient of determination 21 / 24
  • 107. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] 8. Perform the regression diagnostics a) confidence regions for individual prediction intervals b) confidence regions for the average prediction interval 9. Make a residual analysis a) nonnormality, heteroskedasticity, nonindependence errors 22 / 24
  • 108. Step-by-step simple linear regression analysis EXAMPLE [Population and avg. Household Size in Global Cities] 10. Detect outliers and influential observations 11. Interpret the results 12. Draw the conclusions 23 / 24
  • 109. BUSINESS STATISTICS II Lecture – Week 15 Antonio Rivero Ostoic School of Business and Social Sciences  April  AARHUS UNIVERSITYAU
  • 110. Today’s Outline Multiple regression model • coefficients • estimation • conditions • testing • diagnostics Working example (SE estimates, and fitting the model with logarithmic transformations) 2 / 17
  • 111. Multiple regression model While a simple regression analysis has a single independent variable, in a multiple regression analysis we count with several explanatory variables for the response variable A multiple regression model is represented by the equation y = β0 + β1x1 + β2x2 + · · · + βkxk + where y is the dependent variable, x1, x2, . . . , xk are independent variables, and is the error variable ª note that independent variables may be product of transformations from other variables (which are independent or not) In this case parameters β1, β2, . . . , βk are the regression coefficients, whereas β0 represents the intercept 3 / 17
  • 112. Multiple regression model It is important to note that the introduced multiple regression equation represents in this case an additive model Thus the effect of each independent variable on the response is assumed to be the same for all values of the other predictor ª certainly we need to assess whether the additive assumption is realistic or not Q. Do we still considering a linear relationship in the multiple regression model? A. Yes, whenever the model has linear coefficients 4 / 17
  • 113. Graphical representation Multiple regression models are graphically represented by a hyperplane with k dimensions for IVs – for k = 2 the relationships between the IVs and the DV is represented by a regression plane within a 3D space – for k 2 the model is represented by a regression or response surface, a hyperplane (2D) that is not conceivable to visualize for us 5 / 17
  • 114. Interpreting Coefficients In the multiple regression model β0 stands for the intersection of the regression hyperplane, and represents the mean of y when x’s equal 0 ª it makes only sense if the range of the data includes zero βi, i = 1, . . . , k represent the change in the DV when xi changes one unit while keeping the other IVs constant When is possible, interpret the regression coefficients as the ceteris paribus effect of their variation on the dependent variable ª i.e. “other things being equal” interpretation 6 / 17
  • 115. Estimation The estimation of the coefficients is given by the least squares equation ˆy = b0 + b1x1 + b2x2 + · · · + bkxk for k independent variables And the error variable is estimated as e+i = yi − ˆyi 7 / 17
  • 116. Required conditions The required conditions of the error variable assumed in a simple linear regression model remain for multiple regression analysis ª that is errors are independent, normally distributed with mean 0 and a constant σ The standard error of the estimate has less df than in the simple regression analysis ª we want SE close to zero 8 / 17
  • 117. Testing the regression model We test the validity of the model with the following hypotheses H0 : β1 = β2 = · · · = βk = 0 H1 : βi = 0 for at least one i ª The model is invalid in case we fail to reject the null hypothesis, whereas whenever the alternative hypothesis is accepted then the model has some validity Since in multiple regression models we count with several competing explanatory variables for a response variable, then the assessment of the model is central in the analysis 9 / 17
  • 118. Testing the regression model The test of significance of the model is based on the F statistics, which means that we focus on the variation of the outcomes The F-test is the proportion of the Mean Squares of Regression and Residual F = SSR/k SSE/n − k − 1 = MSR MSE Recall that SSR represents the explained variation in the model, whereas SSE is the unexplained variation ª we want a high value for SSR and a low value of SSE, since this indicates that most of the variation in the response variable is explained by the model 10 / 17
  • 119. Testing the regression model For the F-test the rejection of H0 applies when F Fα, k, n−k−1 ª hence for a given α level we infer difference in the regression coefficients in case that the F statistic value fails within the rejection region Another way to assess the model is through the coefficient of determination or R2, which interpretation is similar to the simple regression analysis ª we want R2 close to one 11 / 17
  • 120. Test of individual coefficients Based on the test of significance of the multiple regression model we can perform individual t tests for each regression coefficient H0 : βi = 0 H1 : βi = 0 (two-tail test) The test statistic is t = bi − βi sbi 12 / 17
  • 121. Test of individual coefficients And the confidential intervals are bi ± tα/2, n−k−1· sbi for i = 1, . . . , k We reject the null hypothesis iff t tα/2, n−k−1 (for a two-talied test) 13 / 17
  • 122. Adjusted R-squared When we add explanatory variables to the multiple regression model we cannot decrease the value of the coefficient of determination ª but it is possible to get a very high R2 even when the true model is not linear Thus the adjusted R-squared is often used to summarize the multiple fit as it takes into account the number of variables in the model ª it is the coefficient of determination adjusted for df Adjusted R2 = 1 − MSE MS Total where MSE = SSE/(n − k − 1), and MS Total is the sample variance of y Adjusted R2 ≤ R2 14 / 17
  • 123. Regression diagnostics: multicollinearity In addition to nonnormality and heteroskedasticity, the regression diagnostics for a multiple model checks also for multicollinearity Multicollinearity occurs when two or more independent variables are highly correlated with one another ª hence it is very difficult to separate their particular effects and influences on y It causes inflated standard errors for estimates of regression parameters and very large regression coefficients Some consequences of this inflation are: – a large variability of the samples, which causes that the sample coefficients may be far from the population parameters, and hence with wide confidence intervals – small t statistics that suggest no linear relationship between involved variables and the response variable, and such inference may be wrong 15 / 17
  • 124. Multicollinearity Multicollinearity can be avoided if one anticipates the problem from theory or past experiences ª multiple correlation scores can serve as a guide Beware that two independent variables can be highly correlated with each other (or with another predictor) but uncorrelated with the dependent variable ª they may be non-redundant suppressor variables A stepwise regression (backward and forward) can serve to minimize multicolliniearity in the modelling ª these methods are based on improving the models fit 16 / 17
  • 125. Multiple regression analysis WORKING EXAMPLE [Prediction of avg. Household Size in Global Cities] Multiple regression analysis using globalcity-multiple.sav 17 / 17
  • 126. BUSINESS STATISTICS II Lecture – Week 17 Antonio Rivero Ostoic School of Business and Social Sciences  April  AARHUS UNIVERSITYAU
  • 127. Today’s Outline Model building in multiple linear regression – predictors Comparing regression models Stepwise regression Working example – model building – model comparison Further issues (...) 2 / 16
  • 128. Model building in multiple linear regression The main goal in model building is to fit a model that explains variation of the dependent variable with a small set of predictors ª i.e. a model that efficiently forecasts the response variable of interest When dealing with multiple independent variables, each subset of x’s represents a potential model of explanation ª for k predictors in the data set there are 2k − 1 subsets of independent variables Thus we want to establish a linear equation that predicts ‘best’ the values of y by using more than one explanatory variable Recall that to obtain a good model we need a R2 score closer to 1, a small value for SE , and a large F statistic (which implies a small SSE) 3 / 16
  • 129. Predictors There are two types of independent variables to consider, and they correspond to the numeric and the categorical variables – Factors characterize qualitative data – Covariates represent quantitative data Predictors = Factors + Covariates Sometimes an abstraction made on a numeric variable is called a factor that explains the theory in the regression model, and covariate is simply a control variable 4 / 16
  • 130. Comparing Regression Models cf. F-general in Note 2 To test of whether a model fits significantly better than a simpler model In this case a restricted or reduced model is nested within an unrestricted or complete model ª i.e. one model is contained in another model The test statistics can be based on the SSE or on the R2 values for both models Fchange = (R2 U − R2 R) / df1 (1 − R2 U) / df2 where df1 = q = kU − kR (i.e. number of variable restrictions), and df2 = n − kU − 1 5 / 16
  • 131. Comparing Regression Models F-general with sum of squares On the other hand, by considering the sum of squares of the residuals, the F statistics becomes Fchange = (SSER − SSEU) / df1 SSEU / df2 with the same df’s as before, and we take the absolute value SPSS We need to combine in Analyze Regression Linear the two models with a different variable selection Method (Enter and Remove in Blocks 1 and 2), and check R squared change in Statistics... 6 / 16
  • 132. Comparing Regression Models nested models SPSS The syntax procedure for comparing two nested models is..: REGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA CHANGE /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT y /METHOD=ENTER x1 x2 /METHOD=REMOVE x2. 7 / 16
  • 133. Comparing Regression Models ...that for the data in Note 2 produces this outcome for both models: Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 2 ,55a ,304 ,297 67,45215 ,304 48,426 3 333 ,000 ,41 b ,167 ,164 73,57910 -,137 32,811 2 333 ,000 Predictors: (Constant), years potential experience, years of education, years with current employera. Predictors: (Constant), years of educationb. ª the Fchange for Model 2 is for kU = 3 and kR = 1 this statistic is also equivalent to the F score in the analysis of variance of both models 8 / 16
  • 134. Stepwise regression Variable selection A sequential procedure to perform multiple regressions is found in the stepwise method It combines forward selection of predictors and backward elimination of the independent variables These are bottom-up and top-down processes based on F scores and predefined p values ª defaults in SPSS are 5% for IN, and 10% for OUT 9 / 16
  • 135. WORKING EXAMPLE [Average Household Size in Global Cities] Model Building (Data in globalcity-multiple.sav) 10 / 16
  • 136. Avg. household size in global cities Model assessment Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 2 3 4 5 ,713a ,508 ,506 1,19059 ,508 239,764 1 232 ,000 ,760b ,578 ,574 1,10517 ,070 38,248 1 231 ,000 ,787c ,620 ,615 1,05170 ,041 25,087 1 230 ,000 ,798d ,637 ,631 1,02944 ,018 11,053 1 229 ,001 ,805e ,648 ,641 1,01542 ,011 7,367 1 228 ,007 Predictors: (Constant), Household Connection to Watera. Predictors: (Constant), Household Connection to Water, Average Income Q3 Personb. Predictors: (Constant), Household Connection to Water, Average Income Q3 Person, Overall Child Mortalityc. Predictors: (Constant), Household Connection to Water, Average Income Q3 Person, Overall Child Mortality, Informal Employment d. Predictors: (Constant), Household Connection to Water, Average Income Q3 Person, Overall Child Mortality, Informal Employment, Percent Woman Heade of Households e. 11 / 16
  • 137. WORKING EXAMPLE [Average Household Size in Global Cities] Comparing nested models 12 / 16
  • 138. Avg. household size in global cities models 4 and 5 The F change in the two nested models is given in: Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 2 ,805a ,648 ,641 1,01542 ,648 84,113 5 228 ,000 ,798 b ,637 ,631 1,02944 -,011 7,367 1 228 ,007 Predictors: (Constant), Percent Woman Heade of Households, Informal Employment, Average Income Q3 Person, Overall Child Mortality, Household Connection to Water a. Predictors: (Constant), Informal Employment, Average Income Q3 Person, Overall Child Mortality, Household Connection to Water b. 13 / 16
  • 139. Avg. household size in global cities the final model? Estimate Std. Error t value Pr(|t|) (Intercept) 5.4130 0.3705 14.61 0.0000 x10 −0.0191 0.0031 −6.13 0.0000 x3 −0.0001 0.0000 −3.95 0.0001 x5 0.0790 0.0157 5.04 0.0000 x9 0.0131 0.0041 3.18 0.0017 x6 −0.0104 0.0038 −2.71 0.0072 And what about this other one..? y = x4 + x5 + x6 + x8 + x9 + x10 14 / 16
  • 140. Further Issues multiple regression Comparison of separate models Regression diagnostics Collinearity tests Logarithmic transformations Interpretation of results 15 / 16
  • 141. Summary Conclusions Find a parsimonious model that effectively explains y Model comparison combines evaluation of the fits and the significance of regression coefficients ª available automated procedures To compare nested models we use the F statistics ª working example, and data in note 2 WORKING EXAMPLE: “It seems that the inclusion of the ratio of woman head of households improves the model, but does it contribute to explain the change in the average of the household size in the global cities?” 16 / 16
  • 142. BUSINESS STATISTICS II Lecture – Week 18 Antonio Rivero Ostoic School of Business and Social Sciences  April  AARHUS UNIVERSITYAU
  • 143. Today’s Outline Polynomial regression models Regression models with interaction Comparing models (note 3) Dummy variables 2 / 20
  • 144. Polynomial regression Polynomial regression is a particular case of a regression model that produces curvilinear relationship between response and predictor Recall that simple regression equations represent first-order models y = β0 + β1x + Here the order of the equation p equals 1 and the relation between the predictor and the response is depicted by a regression line ª the model has a ‘degree 1 polynomial’ We can have regression equations with several independent variables that are polynomial models and still having just one predictor variable Remember that when the parameters in the equation are linearly related, then the polynomial regression model is considered as linear 3 / 20
  • 145. First order and polynomial regression models • First order model with two predictors: x1 and x2 y = β0 + β1x1 + β2x2 + • First order model with k predictors: x1, . . . , xk y = β0 + β1x1 + β2x2 + · · · + βkxk + • Polynomial model with one predictor variable x and order p y = β0 + β1x + β2x2 + · · · + βpxp + ª thus a predictor variable can have various orders or powers 4 / 20
  • 146. Second-order models • A second-order (polynomial) model with a single predictor variable has p = 2 and the equation represents a quadratic response function depicted by a parabola ª a ‘degree 2 polynomial’ or quadratic polynomial y = β0 + β1x + β2x2 + β1 controls for translation parameter of the parabola, and β2 for its curvature rate 5 / 20
  • 147. Quadratic effect of the regression coefficient second-order model with β2x2 β2 = 1 x y convex β2 = −1 x y concave 6 / 20
  • 148. Third-order models • A third-order (polynomial) model with a single predictor variable has p = 3 and the equation represents a cubic response function and depicted as a sigmoid curve ª a ‘degree 3 polynomial’ y = β0 + β1x + β2x2 + β3x3 + there are three regression coefficients that control for two curvatures 7 / 20
  • 149. Cubic effect of the regression coefficients third-order model (β1 0 and β2 0) β3 0 x y β3 0 x y 8 / 20
  • 150. Higher-order models and several predictor variables Models with order 3 are seldom used in regression analysis ª typically because of the overfitting in the model and the poor prediction power However, so far we have seen multiple regression equations involving several predictors that are related in an additive model ª that is, the effect of each IV was not influenced by the other variables As illustration, consider a monomial model with two predictors (from the WORKING EXAMPLE) y = 5.47 − .03 x10 + .02 x9 (avg. household size as a function of access to water and informal employment) for x9 = 1 then ˆy = 5.49 − .03 x10 for x9 = 50 then ˆy = 6.47 − .03 x10 for x9 = 99 then ˆy = 7.45 − .03 x10 9 / 20
  • 151. Additive model with 2 predictors 0 20 40 60 80 100 23456789 x y y^ = 5.49 + −0.03x 10 / 20
  • 152. Additive model with 2 predictors 0 20 40 60 80 100 23456789 x y y^ = 5.49 + −0.03x y^ = 6.47 + −0.03x 11 / 20
  • 153. Additive model with 2 predictors 0 20 40 60 80 100 23456789 x y y^ = 5.49 + −0.03x y^ = 6.47 + −0.03x y^ = 7.45 + −0.03x 12 / 20
  • 154. Comparing models Note 3 Four models: (1) first order; (2) second order; (3) linear-log; (4) log-linear a) The t test is used to compare models (1) and (2) ª since (1) is the reduced version of (2) we can use the Fchange score for nested models where t = √ F b) Models (1) and (3) are not nested; we choose one with a better fit c) Models (2) and (3) are neither nested and we rely on R 2 since they have a different number of predictors (performances are almost identical here...) d) Comparing a log-linear model with an untransformed response requires another approach and it is out of the scope... 13 / 20
  • 155. Regression models with interaction Many times the effect of a certain explanatory variable on the response is affected by the value of another predictor of the model In such cases there is an interaction between the two predictors, and the influence of these variables on y does not operate in a simple additive pattern A first order model with interaction: y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + where the effect of x1 on the response is influenced by x2 and vice-versa An interaction exists in the regression model when a regression coefficient varies with a different value of another coefficient ª not easy to interpret 14 / 20
  • 156. Example A model with two the predictors and interaction from the WORKING EXAMPLE y = 6.58 − .04 x10 + .00 x9 + .00 x10 x9 produces no interaction because in the model b3 equals zero ª this may be explained by the high correlation between y and x9 15 / 20
  • 157. Estimating multiple regression with interaction An important concern with multiple regression is that lower order variables are highly correlated with their interactions Centering and standardization of predictors correct this problem ª Centering implies re-scaling the predictors by subtracting the mean from each observation, and by dividing the centering scores with the standard deviation of the variable we standardize the predictors Model with interaction from the WORKING EXAMPLE with standardized values y = 1.11 − .50 x10 + .35 x9 + .16 x10 x9 for x9 = 1 then ˆy = 1.46 − .34 x10 for x9 = 2 then ˆy = 1.81 − .18 x10 which means that the fitted lines are not parallel as with the additive model 16 / 20
  • 158. Higher order models with interaction Higher order models with interaction produce quadractic, cubic (W, M or other shape) relationships between the response and each of the predictors Model with a quadratic relationship and interaction y = β0 + β1x1 + β2x2 + β3x2 1 + β4x2 2 + β5x1x2 + will produce parabolas with crossing trajectories... 17 / 20
  • 159. Regression with dummy variables Until now we have been doing regression analysis using interval scales of the data only However in many cases we may count with qualitative data that are represented by a nominal scale, and treating this type of data as interval brings misleading results We can perform regression analysis by using dummy or indicator variables, which are artificial variables that encode the belonging or not of an observation to a certain group or category ª code 1 for belonging, and code 0 otherwise Indicator or dummy variables are just for classification purposes and the magnitude used is not applicable in this context 18 / 20
  • 160. Regression with dummy variables For 3 categories we use 2 indicator variables I 1 I 2 Category 1 1 0 Category 2 0 1 Category 3 0 0 For 4 categories we use 3 indicator variables... I 1 I 2 I 3 Category 1 1 0 0 Category 2 0 1 0 Category 3 0 0 1 Category 4 0 0 0 How many dummies are required for a variable having two categories? 19 / 20
  • 161. Dummies with command-line We need to create a number of dummy variables according to the existing number of categories. Syntax in SPSS: RECODE varlist_1 (oldvalue=newvalue) ... (oldvalue=newvalue) [INTO varlist_2]. [/varlist_n]. EXECUTE. 20 / 20
  • 162. BUSINESS STATISTICS II Lecture – Week 19 Antonio Rivero Ostoic School of Business and Social Sciences  May  AARHUS UNIVERSITYAU
  • 163. Today’s Outline Qualitative Variables Regression Models: Testing and Interpreting Results • indicators • multiple • interaction • (polynomial) • logarithmic transformations 2 / 24
  • 164. Qualitative independent variables The effects of qualitative information on a response variable may be an important result, and we need ways to include this type of data in a regression model Qualitative information correspond to a nominal scale that my require a pre-coding of the data into artificial variables known as dummies or indicator variables Recall that a nominal scale includes different categories or groups that serve to classify the observations, and qualitative predictors are factors A dichotomous factor has two categories (e.g. gender), whereas a polytomous factor has more categories (e.g. seasons) 3 / 24
  • 165. Indicator variables (dummies) Indicator variables have only two values, typically 1 and 0, and for m categories in the variable, we require m − 1 indicator variables ª this means that there is an omitted category in the representation to avoid redundancy Ii = 1 if obs. belongs to a category ci 0 otherwise. The omitted category represents the baseline or ‘reference’ category to which we compare the other groups ª the decision to choose the omitted category is arbitrary, and it leads to the same conclusion If we do not omit one category and include indicator variables for all categories in the regression model, then there is a perfect multicollinearity among these independent variables ª a phenomenon known as the dummy variable trap 4 / 24
  • 166. Dataset for Notes 3 and 4 training data337.sav Dependent variable: Wage, average hourly earnings (DKK) Independent variables: Educ, education (years) Tenur, current employment (years) Exper, potential experience (years) Female, gender (0: male, 1: female) (Male, gender (0: female, 1: male)) 5 / 24
  • 167. Simple regression with an indicator variable (dichotomous factor) “The gender wage gap” Are women paid less than men according to the data? Wage = β0 + β1 Female + Estimate Std. Error t value Pr(|t|) (Intercept) 161.9242 5.5013 29.43 0.0000 Female -62.8700 8.1117 -7.75 0.0000 Women earn 62.87 DKK per hour less than men 6 / 24
  • 168. Simple regression with an indicator variable II (dichotomous factor) For a variable Male = 1 − Female, and the model: Wage = β0 + β1 Male + we get the following results: Estimate Std. Error t value Pr(|t|) (Intercept) 99.0542 5.9612 16.62 0.0000 Male 62.8700 8.1117 7.75 0.0000 Likewise men earn 62.87 DKK per hour more than women 7 / 24
  • 169. The dummy variable trap What about this model?: Wage = β0 + β1Female + β2Male + In this case there is a duplicated category and the independent variables are perfectly multicollinear Male is an exact linear function of Female and of the intercept ª Male = 1 − Female implies that Male + Female = 1 8 / 24
  • 170. Multiple Regression with a dichotomous indicator variable (factor and covariates) An additive dummy-regression model: Wage = β0 + β1Female + β2Educ + β3Tenure + • (We already know that the model fit or R2 never decreases when we add to the model new independent variables) • The model now assumes that – besides gender – there is an effect of education and tenure on the wage levels • Since the model is additive the predictors are independent to each other, and the regression equation fits identical slopes for all the categories in gender and for the other predictors as well ª which implies parallel regression lines in the scatterplot 9 / 24
  • 171. Testing partial coefficients For model: Wage = β0 + β1Female + β2Educ + β3Tenure + Test the partial effect of gender: H0 : β1 = 0 H1 : β1 = 0 Test the partial effect of education: H0 : β2 = 0 H1 : β2 = 0 Test the partial effect of tenure: H0 : β3 = 0 H1 : β3 = 0 10 / 24
  • 172. Testing partial coefficients The t-test is the coefficient divided by the SE of the estimate ti = bi − βi sbi Estimate Std. Error t value Pr(|t|) (Intercept) -49.2529 20.5869 -2.39 0.0173 Female -46.7547 7.1544 -6.54 0.0000 Education 13.9233 1.4564 9.56 0.0000 Tenure 3.2485 0.4729 6.87 0.0000 11 / 24
  • 173. Fitted values by gender: Additive model Wage = β0 + β1Female + β2Educ + β3Tenure + years of education 18161412108 UnstandardizedPredictedValue 300,00000 200,00000 100,00000 ,00000 Fit line for Total Female Male R2 Linear = 0,435 Linear Regression 12 / 24
  • 174. Fitted values by gender: Additive model Wage = β0 + β1Female + β2Educ + β3Tenure + years of education 18161412108 UnstandardizedPredictedValue 300,00000 200,00000 100,00000 ,00000 Female Male Male: R2 Linear = 0,575 Female: R2 Linear = 0,715 Linear Regression 13 / 24
  • 175. Multiple regression with interaction: factor and covariate (indicator variable and continuous variable) • Many times the additive models are unrealistic, and theory suggest different slopes for different categories • To capture such difference in slopes we assume statistical interaction among independent variables Wage = β0 + β1Female + β2Educ + β3(Female × Educ) + Estimate Std. Error t value Pr(|t|) (Intercept) -18.1088 26.1498 -0.69 0.4891 Female -23.9223 41.7171 -0.57 0.5667 Educ 13.7154 1.9550 7.02 0.0000 Female × Educ -2.6485 3.1844 -0.83 0.4062 ª The effect of gender on wage is influenced by education and vice-versa (no sig.) 14 / 24
  • 176. Fitted values by gender: Interaction model Wage = β0 + β1Female + β2Educ + β3(Female × Educ) + years of education 18161412108 UnstandardizedPredictedValue 250,00000 200,00000 150,00000 100,00000 50,00000 ,00000 Female Male Male: R2 Linear = 1 Female: R2 Linear = 1 Linear Regression 15 / 24
  • 177. Testing interaction We can test for interaction in the model Wage = β0 + β1Female + β2Educ + β3(Female × Educ) + The null hypothesis is that there is no interaction in the model, i.e. H0 : β3 = 0 H1 : β3 = 0 We apply now the F-general (or F incremental) statistics... Fchange = (R2 U − R2 R) / df1 (1 − R2 U) / df2 where df1 = q = kU − kR (i.e. number of variable restrictions), and df2 = n − kU − 1 ª In this case the complete or unrestricted model has the statistical interaction term whereas the reduced model does not have this term 16 / 24
  • 178. Testing interaction In an additive dummy-regression model it is possible to test for effect of categorical variable on the response controlling for a quantitative predictor, and vice-versa ( i.e. test for effect of a covariate controlling for factor) e.g. test gender on wage controlling for education, and test education controlling for gender In such cases the null hypothesis is that the coefficient of the variable to be tested equals zero 17 / 24
  • 179. Multiple Regression with a polytomous indicator variable Data from Keller xm16-02.sav A polytomous indicator variable has more than two categories: Price = β0 + β1Odometer + β2Color + I1 = 1 if colour is white 0 otherwise. I2 = 1 if colour is silver 0 otherwise. • The reference category is ‘all other colours’ that is represented whenever I1 = I2 = 0 18 / 24
  • 180. Multiple Regression with a polytomous indicator variable • In a multiple regression with a polytomous indicator variable we obtain coefficients each group except for the reference category Estimate Std. Error t value Pr(|t|) (Intercept) 16.8372 0.1971 85.42 0.0000 Odometer -0.0591 0.0051 -11.67 0.0000 White 0.0911 0.0729 1.25 0.2143 Silver 0.3304 0.0816 4.05 0.0001 • The t-test is adequate for the covariate (i.e. odometer), but for color we prefer to test the two indicator variables simultaneously, and this is because the election of the reference category is arbitrary ª the F test allow us to do this • Part of the interpretation of the results assumes that one or more indicator variables equal 0 19 / 24
  • 181. EXERCISE: MULTIPLE REGRESSION WITH A POLYTOMOUS INDICATOR VARIABLE [MBA data from Keller xm18-00.sav] 20 / 24
  • 182. Interpreting Results Recall that the interpretation in regression analysis is on average, it considers the units of measure of the involved variables, and in additive models is by holding constant the values of the other variables (including the error) In regression with indicator variables the coefficients corresponding to these variables represent a variation on the response with respect to the other groups in the model The statistical significance of the regression coefficients comes after the interpretation of their effects on the response and not alone The conclusions should account for the values of the regression coefficients and the statistical significance of these outcomes 21 / 24
  • 183. Interpreting logarithmic transformations log is a natural logarithm base e In Note 3 models (3) and (4) have logarithmic transformations on variables, and we will see how to interpret the results in these models Model (3), level-log Wage = β0 + β1Educ + β2 log(Tenure) + In this linear-log model, a one percent increase in years of experience (tenure) leads to β2/100% change in wage unit ∆Wage % ∆Tenure = b2/100 ª Since b2 = 31.32, then –holding education constant– a one percent change in tenure is associated with 0.3132 DKK increase hourly in wage on average 22 / 24
  • 184. Interpreting logarithmic transformations Model (4), log-level log(Wage) = β0 + β1Educ + β2Tenure + This is a log-linear model where a one unit increase in the predictor leads to bi ∗ 100% change on wage % ∆Wage unit ∆xi = bi ∗ 100 • Holding education constant, a one year increase with current employer is associated with 2.5% increase in wage per hour on average • Holding tenure constant, 1 year more of education is associated with 10.4% increase in wage per hour on average 23 / 24
  • 185. Interpreting logarithmic transformations log-log models are interpreted as elasticity i.e. the ratio percent change in one variable to the percent change in another variable % ∆y % ∆xi • One percent change in xi is associated with bi% change in y (ceteris paribus) partial elasticity when we hold constant other variables 24 / 24
  • 186. BUSINESS STATISTICS II Lecture – Week 20 Antonio Rivero Ostoic School of Business and Social Sciences  May  AARHUS UNIVERSITYAU
  • 187. Today’s Outline Exam 2013 • Comparing groups (Q1) • Regression analysis (Q3 and Q6) 2 / 27
  • 188. Basic terminology data... quantitative = interval qualitative = categorical, nominal for regression... Dependent variable, y = response variable; prediction; predicted variable, ˆy Independent variable, x = explanatory variable; predictor; factor (qualitative), covariate (quantitative) 3 / 27
  • 189. Exam 2013 The exam 2013 had 8 questions, and some were based on a single data set The data set contained 13 labor market related variables (though one transformed) among 762 observations form men and women ª however not all variables were needed to answer the questions After you read carefully the instructions, check the data with the software, and put labels to the variables with the provided descriptions and units of measure (if specified) 4 / 27
  • 190. Comparing groups Q1a) Do wages differ by gender? • Implied variables: Wage K, and gender B • Groups to compare: Wages for men and wages for women Plot data in SPSS ª Plot histogram for wage grouped by gender (B) Graphs Legacy Plots Histogram where the variable is paneled by the two groups (optional normal curve)... 5 / 27
  • 191. Comparing groups Q1a) Do wages differ by gender? We compare the means of these two groups through the t-test However we need to see first whether these groups have equal variances or not through ª to know whether to use the pooled or the unpooled version of the t test Thus we perform the F test for equality of variances first Obtain basic descriptive statistics in SPSS Analyze Reports Case Summaries... where the variable is paneled by the two groups... ª uncheck Display Cases and choose statistics 6 / 27
  • 192. Review: F test and sample variance H0 : σ2 1/σ2 2 = 1 H1 : σ2 1/σ2 2 = 1 F = σ2 1/s2 1 σ2 2/s2 2 = s2 1 s2 2 for v1 = n1 − 1 and v2 = n2 − 1 where for 1, 2, ..., n observations: variance s2 = n i=1(xi − x)2 n − 1 7 / 27
  • 193. Review: t test and sample mean independent samples and H0 : µ1 = µ2 pooled t = (x1 − x2) − (µ1 − µ2) s2 p 1 n1 + 1 n2 where s2 p = (n1 − 1)s2 1 + (n2 − 1)s2 2 n1 + n2 − 2 unpooled t = (x1 − x2) − (µ1 − µ2) s2 1 n1 + s2 2 n2 v = n1 + n2 − 2 when σ2 1 = σ2 2 v = s2 1/n1 + s2 2/n2 2 (s2 1/n1)2 n1−1 + (s2 2/n2)2 n2−1 when σ2 1 = σ2 2 which for 1, 2, ..., n observations: mean ¯x = n i=1 xi n 8 / 27
  • 194. F test to wages by gender After obtained the F statistics, we check the critical values with the respective degrees of freedom and the standard alpha value ª use the Excel calculator or/and table for F-distribution In this case the F ratio is within the critical region, which means that we reject H0 of equal variances, i.e. F ratio = 1 ª the p-value indicates that the result is statistically significant Both outcomes suggest that there evidence to infer that the ratio of variances differ We know now that we can proceed with the analysis applying the unpooled t test 9 / 27
  • 195. t test to wages by gender Q1a) Do wages differ by gender? Although in this part the calculations are written by hand; you can compare your results with the outcomes from SPSS t test in SPSS Analyze Compare Means Independent-Samples T Test... and the test variable K is paneled by the two groups in B ª We Define Groups... by putting 0 and 1 that characterize the gender variable Confidence intervals are also given in the table of the t test for independent samples... 10 / 27
  • 196. Comparing groups Q1b) Find a 95% confidence interval for tenure by gender • Implied variables: Tenure G, and gender B • Groups to compare: Tenure for men and for women In this case is the pooled t test with the confidence interval estimators 11 / 27
  • 197. Review: Confidence intervals for t test pooled Confidence interval estimator of µ1 − µ2 when σ2 1 = σ2 2 (x1 − x2) ± tα/2 s2 p · 1 n1 + 1 n2 for v = n1 + n2 − 2 12 / 27
  • 198. Comparing groups Q1c) Find a 95% CI by gender with 15 years of education • Implied variables: Education I, and gender B • Groups to compare: Men and women, 15 yrs. of educ. In this case the difference is between population proportions 13 / 27
  • 199. Review: Confidence Interval of p1 − p2 (ˆp1 − ˆp2) ± zα/2 ˆp1(1 − ˆp1) n1 + ˆp2(1 − ˆp2) n2 for unequal proportions, and n1ˆp1, n1(1 − ˆp1), n2ˆp2, and n2(1 − ˆp2) 5 For the number of successes in the two populations, x1 and x2 ˆp1 = x1 n1 and ˆp2 = x2 n2 14 / 27
  • 200. Test of proportion p1 − p2 As when we compared means, the calculations for proportions should be made manually. However to find x1, x2, and n1, n2 you can use SPSS Proportion success in SPSS Analyze Descriptive Statistics Crosstabs... where I is contrasted by B - Alternatively, you can create new indicator variable, say PL15 = 1 iff I 3 Transform Recode into Different Variables , and in Old and New Values... recode to 1 the Range category 3 through 4, and 0 otherwise, after naming the new variable Then get a report Analyze Reports Case Summaries... where PL15 is the Variable that is grouped by B, and specifying Number of Cases and Sum in Statistics 15 / 27
  • 201. Test of proportion p1 − p2 By combining the two categorical variables, we obtain the sample estimates for both proportions that are men and women ª and we are able to pursue with the arithmetic calculations For 95% confidence interval, the multiplier zα/2 is 1.96 ª the score comes from the z table in Keller for 1 − (.05/2) 16 / 27
  • 202. Comparing groups Summary comment Q1a. Men earn significantly more than women Q1b. With 95% confidence the difference interval of 2.8 to 4, men have more years of market experience than women Q1c. With 95% confidence the difference interval of 10 to 24 percentage points, men have less schooling level than women • Relate the implied variables of wage, tenure, and education level for both groups • Explain why the differences might occur in that way... ª eventually using other variables from the data 17 / 27
  • 203. Regression analysis Q4a) Estimation and regression diagnostics for an additive log linear regression model Dependent variable: M, the natural logarithm of K, wage (hourly) Independent variables: B, gender (male = 1, female = 0) C, education (years) G, market experience or tenure (years) 18 / 27
  • 204. Regression analysis Q4. where log = ln The regression equation represents a log level model: ln(Wage) = β0 + β1 Male + β2 Educ + β3 Tenure + Estimate Std. Error t value Pr(|t|) (Intercept) 4.4353 0.0594 74.65 0.0000 Male 0.1268 0.0132 9.63 0.0000 Educ 0.0431 0.0031 13.79 0.0000 Tenure 0.0089 0.0019 4.69 0.0000 19 / 27
  • 205. Model diagnostics Multiple regression After performing linear regression analysis... Check the assumptions | x ∼ N(0, σ2 ) and evaluate multicollinearity by: • looking at the correlation among the variables • viewing the histogram of the standardized residuals for the model • plotting the residuals against predicted values 20 / 27
  • 206. Regression results Q4b) Interpretation of the estimation results The fitted model is ˆy = 4.435 + .127 · B + .043 · C + .009 · G This means that, men earn 12.7% more than women, and that wages raise by 4.3% and by almost 1% for an extra year of education and market experience respectively ª interpretation as ceteris paribus or all things being equal Then we interpret individual outcomes in the log level model as the proportion of percentage change in y by a unit change in xi 21 / 27
  • 207. Regression results Q4c. However sometimes we need fitted values in the units of measure of the untransformed response given a set of values in the IVs ª e.g., How much wage is expected for a man or a woman having 12 years of education and 15 years of experience? In this case we apply the exponential function to both hands of the regression equation eln(ˆy) = e(b0+b1x1+b2x2+b3x3) where e ≈ 2.718282 It means for the model that we obtain the value of K rather than the value of ln(K) 22 / 27
  • 208. Regression results Q4c. The fitted value for a man with 12 years of education, and 15 years of market experience is ˆy = 4.435 + .127 · 1 + .043 · 12 + .009 · 15 = 5.213 and the expected return on wage is e5.213 = 183.64 hourly (in DKK) On the other hand, for a woman with similar level of education and experience the fitted value is ˆy = 4.435 + .127 · 0 + .043 · 12 + .009 · 15 = 5.086 and the expected return on wage is e5.086 = 161.69 hourly (in DKK) 23 / 27
  • 209. Prediction interval Manual regression analysis Q6) Construct a 95% prediction interval for y given x = 30 Where the fitted line for n = 100 and R2 = .755 is ˆy = 6.92 + .237x Some descriptives for location and dispersion are: • x = 13 and s2 x = 121 • y = 10 and s2 y = 9 And the Anova table shows: • SSR = 672.61, df = 1 • SSE = 218.39, df = 98 24 / 27
  • 210. Prediction interval Q6. The prediction interval for ˆy∗ | x∗ = 30 ˆy∗ x∗=30 ± tα/2, n−k−1 · MSE · 1 + 1 n + (x∗ − x)2 (n − 1) · s2 x∗ where MSE = s2 = SSE n−k−1 ª In this case, the left hand of the prediction interval estimate correspond to the product of the regression coefficients with k = 1 ª the multiplier can be obtained from the MS Excel calculator for the t distribution 25 / 27
  • 211. Prediction interval Q6. without Anova table It is also possible to calculate SSE from the sample variances and R2 SSE = (n − 1) s2 y − s2 xy s2 x where the covariance s2 xy = R2 · s2 x · s2 y (this is from R2 = s2 xy s2 xs2 y ) Or alternatively: SSE = SSy · (1 − R2 ) where SSy = (n − 1) · s2 y Thus there are various possibilities fot the calculations... 26 / 27