1. BUSINESS STATISTICS II
PART II: Lectures Weeks 11 – 19
Antonio Rivero Ostoic
School of Business and Social Sciences
March − May
AARHUS
UNIVERSITYAU
2. BUSINESS STATISTICS II
Lecture – Week 11
Antonio Rivero Ostoic
School of Business and Social Sciences
March
AARHUS
UNIVERSITYAU
4. Introduction
Galton (Darwin’s half-cousin) found in his observations that:
– For short fathers, on average the son will be taller than his father
– For tall fathers, on average the son will be shorter than his father
Then he characterized these results with the notion of the
“regression to the mean”
Pearson and Lee took Galton’s law about the relationship between
heights of children and parents, and came up with the regression
line:
son’s height = 33.73 + .516 × father’s height
ª This equation shows that for additional inch of father’s height the son’s
height increases on average by .516
3 / 28
5. Regression Analysis
Regression analysis is used to predict one variable on the basis of
other variables
ª i.e. to forecasting
It serves from a model that describes the relationship between a
variable to estimate and the variables that influences this variable
– Response variable is called dependent variable, y
– Explanatory variables are called independent variables,
x1, x2, . . . , xk
Correlation analysis serves to determine whether a relationship
exists or not between variables
Does regression imply causation?
4 / 28
6. Model
A model comprises mathematical equations that accurately
describes the nature of the relationship between DV and IVs
Example for a deterministic model:
F = P(1 + i)n
where
F = future value of an investment
P = present value
i = interest rate per period
n = number of periods
ª In this case we determine F from the values on the equation’s right hand
5 / 28
7. Probabilistic model
However, deterministic models can be sometimes unrealistic,
since other variables that are unmeasurable and not known
can influence the dependent variable
Such types of variables represent uncertainty in real life and
it should be included in the model
In this case we rather use a probabilistic model in order to
incorporate such randomness
A probabilistic model then incorporates and unknown
parameter called the error variable
ª it accounts for all measurable and immeasurable variables that are not
part of the model
6 / 28
8. Simple linear regression model
i.e. First Order model
y = β0 + β1x +
where
y = dependent variable
x = independent variable
β = coefficients
β0 = y-intercept
β1 = slope of the line (rise/run) or (∆Y/∆X)
= error variable
ª Coefficients are population parameters, which need to be estimated
ª The assumption is that the errors are normally distributed
7 / 28
9. Expected values and variance for y
The expected value for y it is a linear function of x, and y differs
from its expected value by a random amount
ª linear regression is a probabilistic model
For x∗
= a particular value of x:
E(y | x∗
) = µy|x∗ (mean)
V(y | x∗
) = σ2
y|x∗ (variance)
8 / 28
10. Estimating the Coefficients
We estimate the coefficients as we estimated population
parameters
That is draw a random sample from the population and
calculate sample statistics
But here the coefficients are part of a straight line, and we need
to estimate the line that represents ‘best’ the sample data points
Least squares line
ˆy = b0 + b1x
here b0 = y-intercept, b1 = slope, and ˆy is the fitted value of y
9 / 28
11. Least squares method
cf. chap. 4 in Keller
The least square method is an objective procedure to obtain a
straight line, where the sum of squared deviations between the
points and the line is minimized
n
i=1
(yi − ˆyi)2
The least squares line coefficients
b1 =
sxy
s2
x
b0 = y − b1x
10 / 28
12. Least squares line coefficients
For b1 and b0
sxy =
n
i=1(xi − x)(yi − y)
n − 1
s2
x =
n
i=1(xi − x)2
n − 1
x =
n
i=1 xi
n
y =
n
i=1 yi
n
11 / 28
13. Least squares line coefficients
This actually means that the values of ˆy on average come
closest to the observed values of y
There are shortcut formula for b1 (check sample variance pp.
110, and sample covariance pp.127)
b0 and b1 are unbiased estimators of β0 and β1
12 / 28
14. EXAMPLE 16.1
Annual Bonus and Years of Experience
Determine the straight-line relationship between annual bonus and
years of experience
13 / 28
15. Working with SPSS
In SPSS we distinguish two main working windows:
1) Data Editor, where the raw data and variables are
displayed
2) Statistics Viewer, where scripts and reports are provided
Both windows have: MENU SUBMENU ... COMMAND
Each command corresponds to a function that bears one or
several ARGUMENTS
14 / 28
16. Working with SPSS
Command-line like
It is also possible to work directly with the functions
Example of the script for a regression:
REGRESSION
/DEPENDENT dependent-variable
/ENTER List-of.independents.
SPSS distinguishes between COMMANDS, FILES, VARIABLES, and
TRANSFORMATION EXPRESSIONS
15 / 28
18. Report in SPSS
GET
FILE='C:auspssxm16-01.sav'.
DATASET NAME DataSet1 WINDOW=FRONT.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Bonus
/METHOD=ENTER Years.
Regression
Notes
Output Created
Comments
Input Data
Active Dataset
Filter
Weight
Split File
N of Rows in Working
Data File
Missing Value Handling Definition of Missing
Cases Used
Syntax
06-MAR-2014 12:45:25
C:auspssxm16-01.sav
DataSet1
none
none
none
6
User-defined missing values are
treated as missing.
Statistics are based on cases with no
missing values for any variable used.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R
ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Bonus
/METHOD=ENTER Years. 17 / 28
19. Regression Report in SPSS
Variables Entered/Removeda
Model
Variables
Entered
Variables
Removed Method
1 Yearsb
. Enter
Dependent Variable: Bonusa.
All requested variables entered.b.
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
1 ,701a
,491 ,364 4,503
Predictors: (Constant), Yearsa.
ANOVAa
Model
Sum of
Squares df Mean Square F Sig.
1 Regression
Residual
Total
78,229 1 78,229 3,858 ,121b
81,105 4 20,276
159,333 5
Dependent Variable: Bonusa.
Predictors: (Constant), Yearsb.
Coefficientsa
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.B Std. Error Beta
1 (Constant)
Years
,933 4,192 ,223 ,835
2,114 1,076 ,701 1,964 ,121
Dependent Variable: Bonusa.
18 / 28
20. Regression Plot from SPSS
Graphs Legacy Dialogs Scatter/Dot... Simple Scatter
Years
654321
Bonus
20
15
10
5
0
y=0,93+2,11*x
R2 Linear = 0,491
19 / 28
21. Calculation of Residuals
The deviations of the actual data points to the line are the
residuals, which represents observations of
ei = yi − ˆyi
In this case the sum of squares for error (SSE) represents the
minimized sum of squared deviations
ª basis for other statistics to assess how well the linear model fits the data
The standard error of the estimate is the square root of the
proportion of SSE and the number of observations
ª Remember that in SPSS the value of the residuals is given in the Anova
table of the regression report
20 / 28
22. Annual bonus and years of experience
q
q
q
q
q
q
0 1 2 3 4 5 6 7
051015
years
bonus
21 / 28
23. Annual bonus and years of experience: Residuals
q
q
q
q
q
q
0 1 2 3 4 5 6 7
051015
years
bonus
2.9524
−4.1619
1.7238
−4.3905
5.4952
−1.619
22 / 28
24. Regression examples
Finance/economy:
– The enterprise equity value and total sales
– Number of VP executives and total assets
– Quantity of new houses and amount of jobs created in a city
– Amount of bananas harvest and the density of banana trees per km2
Social/health:
– Number of violent crime and the poverty rate
– Amount of infectious diseases and population growth
– Amount of diseases from chronic illnesses and urbanization level
– Number of kinds raised and the number of spouses
24 / 28
25. Regression examples
Miscellaneous:
– IQ score development and the average global temperature per year
– If a horse can run X mph, how fast will his offspring run?
– Number of cigarettes smoked and number of chats having with people
– Number of cigarettes smoked and time at the hospital
ª (more politically correct!)
That is, questions like:
– For any set of values on an independent variable, what is my predicted
value of a dependent variable?
– If an independent variable raises its value by 1-unit, how the dependent
variable results?
25 / 28
27. Generating Random Numbers in SPSS
Variable View:
Create two variables for integers
Data View:
Choose number of observations in each variable
Transform Compute Variable
Arguments:
Variable names in Target Variable, and Random Numbers
in Function group
Choose uniform rv and establish the range of the obs. values
27 / 28
28. Summary
Simple linear regression analysis is for the relationship
between two interval variables
The assumption is that the variables are linearly connected
The intercept and the slope of the regression line are the
coefficients to be estimated
The least squares method produces estimates of these
population parameters
28 / 28
29. BUSINESS STATISTICS II
Lecture – Week 12
Antonio Rivero Ostoic
School of Business and Social Sciences
March
AARHUS
UNIVERSITYAU
30. Today’s Outline
Review simple linear regression analysis
Error variable in regression
Model Assessment
– standard error of estimate
– testing the slope
– coefficient of determination
– other measures
2 / 26
31. Review Simple Linear Regression Analysis
Simple regression analysis serves to predict the value of a
variable from the value of another variable
A lineal regression model describes the variability of the data
around the regression line
The observations on a dependent variable y is a linear function
of the observation on an independent variable x
The population parameters are expressed in in two coefficients,
the y-intercept and the slope of the line, which need to be
estimated, plus a stochastic part
ª y-intercept: the value of y when x equals 0
ª slope: the change in y for one-unit increase in x
3 / 26
32. The Error Variable
Remember that in probabilistic models we need to account for
unknown and unmeasurable variables that represent noise or error
The error variable is critical in estimating the regression coefficients
– to establish whether there is a relationship between the dependent
and independent variables via an inferential method
– to estimate and predict through a regression equation
Errors are independent to each other and this variable is normally
distributed with mean 0 and standard deviation σ
ª This is expressed as ∼ N(0, σ )
4 / 26
33. Expected values of y
The dependent variable can be considered as a random
variable normally distributed with expected values
E(y) = β0 + β1x (mean)
σ(y) = σ (standard deviation)
Thus the mean of y depends on the value of the independent
variable, whereas its standard deviation don’t
shape of the distribution remains, but E(y) changes according to x
5 / 26
34. Experimental data and Observations
We have been typically working with examples based on observations
However it is also possible to perform a controlled trial where we
generate experimental data
Regression analysis works with both types of data, since the main
goal is to determinate how the IV is related to the DV
For observations both variables are random, which joint probability is
characterized by the bivariate normal distribution
ª here the z dimension is a joint density function of the two variables
These types of normality conditions are assumptions for the
estimations in a simple linear regression model
6 / 26
35. Assessing the Model
We use the least squares method to produce the best straight line
But a straight line may not be the best representation of the data
We need to assess how well the linear model fits the data
Methods to assess the model:
– standard error of estimate
– the t-test of the slope
– the coefficient of determination
all based on the SSE
7 / 26
36. Standard error of estimate
Recall the error variable assumptions: ∼ N(0, σ )
And the model is considered poor if σ is large, and it is considered
perfect when the value is 0
Unfortunately we do not know this parameter, and we need to
estimate σ from the sample data
The estimation is based on the sum of squares for error (SSE)
ª which is the minimized sum of squared deviations between the points and the
regression line
SSE =
n
i=1
(yi − ˆyi)2
= (n − 1) s2
y −
s2
xy
s2
x
8 / 26
37. Standard error of estimate
The standard error of estimate is the approximation of the
conditional standard deviation of the dependent variable
ª that is, the square root of the residual sum of squares divided by the
number of degrees of freedom
s =
SSE
n − 2
This is the square root of s2, which in fact is the MSE
ª the df is actually number of cases − number of unknown parameters
IN THE SPSS REPORT:
The value for s is given in the Model Summary table for a linear
regression analysis
9 / 26
38. Testing the slope
In this case we test whether or not the dependent
variable is not linearly related to the independent
variable
ª this means that no matter what value x has, we would obtain
the same value for ˆy
In other words, the slope of the line represented by β1
equals zero, and this corresponds to a horizontal line
in the plot
10 / 26
40. Testing the slope
If our null hypothesis is that there is no a linear relationship among
the dependent and independent variables, then we specify
H0 : β1 = 0
H1 : β1 = 0 (two-tail test)
If we do not reject H0, we either committed a Type II error (wrongly
accepting the null hypothesis), or there is not much of a ‘linear’ relationship
between the independent variable and the dependent variable
However the relationship can be a quadratic, which corresponds to
a polynomial regression
ª In case we want to check for a positive (β1 0) or a negative (β1 0)
linear relationship among the IV and DV, then we perform a one-tail test
12 / 26
42. Estimator and sampling distribution
For drawing inferences, b1 as an unbiased estimator of β1
E(b1) = β1
with an estimated SE
sb1
=
s
(n − 1)s2
x
that is based on the sample variance of x
14 / 26
43. Estimator and sampling distribution
If ∼ N(0, σ ) with values independent to each other, then we
use the t-statistics sampling distribution
Test statistics for β1
t =
b1 − β1
sb1
Thus the t-statistic values are proportion of coefficients to their SE
IN THE SPSS REPORT:
The t-statistic values are given in the Coefficients table of the linear
regression analysis
15 / 26
44. Estimator and sampling distribution
Confidential Interval estimator of β1
b1 ± tα/2 sb1
Test statistics and confidence interval estimators are for a
Student t distribution with v = n − 2
IN SPSS:
Confidence intervals are line Properties in the graph Chat Editor
16 / 26
45. Coefficient of Determination
To measure the strength of the linear relationship we use the
coefficient of determination, R2
ª useful to compare different models
R2
=
s2
xy
s2
xs2
y
This is equal to
R2
= 1 −
SSE
(yi − y)2
17 / 26
47. Partitioning deviations in Example 16.1 i = 5
q
q
q
q
q
q
0 1 2 3 4 5 6 7
051015
years
bonus
x = 3.5
y = 8.33
yi = 17
xi = 5
y^
i = 11.504
19 / 26
48. Partitioning deviations in Example 16.1 i = 5
q
q
q
q
q
q
0 1 2 3 4 5 6 7
051015
years
bonus
x = 3.5
y = 8.33
yi = 17
xi = 5
y^
i = 11.504
yi − y^
i
y^
i − y
yi − y
xi − x
20 / 26
49. Partitioning deviations in Example 16.1 i = 2
q
q
q
q
q
q
0 1 2 3 4 5 6 7
051015
years
bonus
yi = 1 xi = 2
y^
i = 5.162
21 / 26
50. Partitioning the deviations
(yi − y) = (ˆyi − y) + (yi − ˆyi)
The difference between yi and y is a measure of the variation in the
dependent variable, and it equals to:
a) the difference between ˆyi and y, which is accounted by the difference
between xi and x
ª the variation in the DV is explained by the changes of the IV
b) and the difference between yi and ˆyi, which represents an unexplained
variation in x
If we square all parts of the equation, and sum over all sample points,
we end up with a statistic for the variation in y
total SS = explained SS + residual SS
ª i.e. sum of squares for regression (SSR) and the sum of squares for error (SSE)
22 / 26
51. Coefficient of Determination
R2 = 1 − SSE
(yi−y)2
=
(yi−y)2
(yi−y)2 − SSE
(yi−y)2
=
(yi−y)2 − SSE
(yi−y)2 = SS(Total) − SSE
SS(Total)
This is the proportion of variation explained by the regression model,
which is the proportion of variation in y explained by x
IN THE SPSS REPORT:
R2
is given in the Model Summary table of the regression analysis
23 / 26
52. Other measures to assess the model
Correlation coefficient
r =
sxy
sxsy
We use t-test for H0 : ρ = 0
t = r
n − 2
1 − r2
which is t distributed with v = n − 2 and variables bivariate distributed
Calculate r in SPSS
Analyze Correlate Bivariate (select variables and choose Pearson)
24 / 26
53. Other measures to assess the model
F-test
F =
MSR
MSE
for MSR = SSR/1 and MSE = SSE/(n − 2)
This statistic is to test H0 : β1 = 0
IN THE SPSS REPORT:
• F-statistic value is given in the Anova table
• Value of r is in the Model Summary table, whereas the t statistics is
given in the table for the Coefficients in the regression analysis
25 / 26
54. Summary
The error variable corresponds to the probabilistic part of the
regression model
ª independent values that are normally distributed with mean 0 and sd σ
The standard error of estimate serves to evaluate the
regression model by assessing the conditional standard
deviation of the dependent variable
By testing the slope we can check whether there is a linear
relationship or not between the independent and the
dependent variables
The coefficient of determination measures the strength of the
linear relationship in the regression model
26 / 26
55. BUSINESS STATISTICS II
Lecture – Week 13
Antonio Rivero Ostoic
School of Business and Social Sciences
March
AARHUS
UNIVERSITYAU
57. Regression Equation
The regression equation represents the model, where the dependent
variable is the response of an independent explanatory variable
ª the model stands for the entire population
After assessing the model, our next task is to estimate and predict
the values of the dependent variable
In this case we differentiate the average response at the dependent
variable from the prediction of the dependent variable from a new
observation in the independent variable
3 / 31
58. Estimating a mean value and predicting an individual value
If a linear model such as
y = β0 + β1x
is considered satisfactory for the data, then
ˆy = b0 + b1x
will represent the sample equation for the estimation of the
model
ª (Here we predict the error term to be 0)
4 / 31
59. Estimating a mean value and predicting an individual value
For x∗ representing a specific value of the independent variable:
ˆy = b0 + b1x∗
– is the point prediction of an individual value of the dependent
variable when the value of the independent variable is x∗
– is the point estimate of the mean value of the dependent
variable when the value of the independent variable is x∗
5 / 31
60. Interval estimators
A small p-value for H0 : β1 = 0 suggests a nonzero slope in
the regression line
However, for a better judgment we need to see how closely
the predicted value matches the true value of y
There are two interval estimators:
a) Prediction interval that predicts y for a given value of x
b) Confidence interval estimator that estimates the mean
of y for a given value of x
6 / 31
61. Prediction interval
individual intervals
ª Used if we want to predict a one-time occurrence for a particular value of y when x
has a given value
For ˆy = b0 + b1xg the prediction interval is
ˆy ± tα/2,n−2 s 1 +
1
n
+
(xg − x)2
(n − 1)s2
x
where xg is the given value of the independent variable
Another way to express this CI is x∗
→ ˆy∗
, which implies that for x∗
that is
a new value of x (or for a tested value of x) the prediction interval for ˆy∗
is
ˆy∗
± tα/2,n−2 MSE 1 +
1
n
+
(x∗ − x)2
sxx
7 / 31
62. Confidence interval estimator
the average prediction interval
For E(y) = β0 + β1x (i.e. for the mean of the dependent
variable) the confidence interval estimator is
ˆy ± tα/2,n−2 s
1
n
+
(xg − x)2
(n − 1)s2
x
That is, for x∗
→ ˆy∗
, the mean prediction interval for ˆy∗
is
ˆy∗
± tα/2,n−2 MSE
1
n
+
(x∗ − x)2
sxx
ª where MSE equals to ˆσ 2
, whereas sxx is the unnormalized form of V(X)
8 / 31
63. EXAMPLE-DO-IT-YOUR-SELF
[A predicted variable and a predictor variable]
Data generation in SPSS
• Choose your DV and IV, and number of observations. Then generate
uniform random numbers:
Transform Compute Variable...
• Variable names in Target Variable , and Random Numbers in Function group
• Select Rv.Uniform in Functions and Special Variables , and then establish the
range of the observation values in Numeric Expression
9 / 31
64. EXAMPLE-DO-IT-YOUR-SELF
[A predicted variable and a predictor variable]
Confidence intervals of the regression model in SPSS
• We perform the linear regression analysis
Analyze Regression Linear
• Individual confidential intervals are given in this command, where in the
bottom Save we select in Prediction Intervals
– the Individual option for Prediction Interval
– the Mean option for the Confidential Interval Estimator
Both at the usual 95% value
10 / 31
65. EXAMPLE-DO-IT-YOUR-SELF
[A predicted variable and a predictor variable]
Confidence intervals of the regression model in SPSS (2)
• Since we have chosen Save , the confidential interval values are saved in
the Data Editor
ª here LMCI [UMCI] and LICI [UICI] stand respectively for Lower
[Upper] Mean and Individual Confidence Interval
The Variable View in the Data Editor gives the labels of the new variables
11 / 31
66. EXAMPLE-DO-IT-YOUR-SELF
[A predicted variable and a predictor variable]
Visualizing confidence intervals in SPSS
• The visualization of both types of confidence intervals are possible after we
plotted the variables
Graphs Legacy Dialogs Scatter/Dot... Simple Scatter
• From Elements Fit Line at Total of the graph Chart Editor, we look in the
tab Fit Line (Properties) the options Mean and Individual in the
Confidential Intervals section for the two CI estimators
12 / 31
67. Confidence bands from SPSS
Example 16.2 in Keller
Odometer
50,040,030,020,010,0
Price
16,5
16,0
15,5
15,0
14,5
14,0
13,5
y=17,25+-0,07*xy=17,25+-0,07*x
R2 Linear = 0,648
R2 Linear = 0,648
13 / 31
68. EXAMPLE-DO-IT-YOUR-SELF
[A predicted variable and a predictor variable]
Predict new observations in SPSS
• To forecast new observations, first we need to put the value in the
dependent variable of the Data Editor
• Then we choose a linear regression analysis
Analyze Regression Linear
• And, after we press the Save bottom, we select the Unstandardized
option in Predicted Values
14 / 31
69. Regression Diagnostics
Here we are concern with evaluating the prediction model that
includes some error or noise
ei = yi − ˆyi
thus the residual equals each observation minus its estimated
value
Recall that in regression analysis there are some assumptions made
for the error variable
ª errors are independent to each other that are normally distributed, and
hence with a constant variance
15 / 31
70. Regression Diagnostics
A regression diagnostics checks for two things:
a) whether or not the conditions for the error are fulfil
b) for the unusual observations (those that fall far from the
regression line), and determine whether or not these
values results from a fault in the sampling
we look at several diagnostic methods for unwanted conditions
16 / 31
71. Residual analysis
Residual analysis focus on the differences between the
observations and the predictions made in the linear model
Residual Analysis in SPSS
Residual analysis is based on standardized and unstandardized residuals
• After choosing linear regression analysis
Analyze Regression Linear
• When we press the Save bottom, we select the Standardized and
Unstandardized options in Predicted Values
ª Recall that these values are recorded in the Data View of the Data Editor
17 / 31
72. Nonnormality
The nonnormality check of the error variable is made by
visualizing the distribution of the residuals
ª we use the histogram for this
Nonnormality in SPSS
The histogram of residuals is obtained from
Graphs Legacy Dialogs Histogram...
• And we choose RES (which corresponds to the unstandardized
residuals) for the Variable option
18 / 31
73. Nonnormality
Nonnormality in SPSS (2)
It is also possible to obtain the distribution shape in the histogram
• In the Chart Editor we go to
Elements Show Distribution
and choose Normal
19 / 31
74. Heteroscedasticity
Heteroscedasticity (or heteroskedasticity) is the term used when
the assumption of equal variance of the error variable is violated
ª homoscedasticity has the opposite implication, meaning ‘homogeneity of
variance’
To test the heterogeneity of variance in the error variable we can
plot the residuals against the predicted values of the DV
ª then we look for the spreading of the points; if the variation in ei = yi − ˆyi
increases as yi increases, the errors are called heteroscedastic
This type of graph is sometimes called the ei − ˆyi plot
20 / 31
75. Heteroscedasticity
Heteroscedasticity in SPSS
The heteroskedasticity condition evaluated by the ei − ˆyi plot
Graphs Legacy Dialogs Scatter/Dot... Simple Scatter
• And choosing RES (the unstandardized residuals) for the Y-axis,
and PRE (the predicted values) for the X-axis
• For the mean line of the residuals in the plot we go to the Chart Editor (by
double-clicking the graph in the report) and in
Options Y Axis Reference Line
• Select the Mean option in the Reference Line tab of Properties
21 / 31
76. Nonindependence of the Error variable
The nonindependence of the errors means that the residuals are
autocorrelated, i.e. correlated over time
To detect autocorrelation we can plot the residuals in a time period
and look for alternating or increment patterns
ª If no clear pattern appears in the plot then there is an indication that the
residuals are independent to each other
Alternatively to detect lack of independence between errors
without time laps, we can perform the Durbin-Watson test
ª where the null hypothesis is that no correlation exists, whereas the alternative
hypothesis is that a correlation exists; i.e. H0 : ρ = 0, and H1 : ρ = 0
we look at this test in multiple regression analysis...
22 / 31
77. Nonindependence of the Error variable
Nonindependence of the error variable in SPSS
We now create a time variable in the EXAMPLE-DO-IT-YOUR-SELF,
and then index the observations with a vector sequence
Transform Compute Variable...
• Index (time) variable in Target Variable , and the Miscellaneous
option in Function group
• Select $Casenum in Functions and Special Variables
23 / 31
78. Nonindependence of the Error variable
Nonindependence of the error variable in SPSS (2)
After obtaining the unstandardized residuals, we plot these values...
Graphs Legacy Dialogs Line... Simple
• We select the Mean of the unstandardized residuals is located in the
Line Represents option, and the time variable in Category Axis
If we go to the Chart Editor we obtain the expected mean in
Options Y Axis Reference Line
24 / 31
79. Outliers
Outliers are unusual (small or large) observations in the sample,
which lie far away from the regression line
These points may suggest: an error in the sampling, a recording
mistake, an unusual observation
ª we should disregard the observation if case of one of the two first possibilities
To detect outliers:
– we serve from scatter diagrams of the IV and DV with the
regression line
– we check the standardized residuals where absolute values
larger than 2 may suggest an outlier
25 / 31
80. Outliers
Detection of outliers in SPSS
First we get the standardized residuals when choosing linear
regression analysis
Analyze Regression Linear
In the bottom Save we select the Standardized in Residuals
Then we obtain the absolute values of this variable
• ZRE 1 in Target Variable , and choose Arithmetic in Function group
• Select Abs in Functions and Special Variables and put this variable code in
the parentheses
26 / 31
81. Influential Observations
We serve from scatter diagrams of the IV and DV with the regression
line as well to evaluate the impact of influential observations
ª we produce two plots, one with and another without the supposed influential obs.
Optionally, to detect influential observation we can use different
measures as well:
Leverage describes the influence each observed value has
on the fitted value for this observation
ª where Mahalanobis distance is a measure of leverage of the observation
Cook’s D (distance) detects dominant observations, either
outliers or observations with high leverage
ª an Influence plot is made of the Studentized Residuals (ei/SE) against
the leverages of the observations (called ‘hat’ values)
27 / 31
82. Cook’s Distance
Example 16.2 in Keller
0 20 40 60 80 100
0.000.020.040.060.080.100.12
Obs. number
Cook'sdistance
Cook's distance
19
74
86
28 / 31
84. Other aspects in Regression Diagnostics
• In the validation of linear model assumptions, we can also evaluate
the skewness, kurtosis in the distribution shape of the residuals...
• The prediction capability of the model can be assessed by looking
at the predicted SSE as well
(in multiple regression we also look at the collinearity among IVs)
30 / 31
85. Summary
For a given explanatory variable, we differentiate the individual
value of the response variable from its mean value
Point estimation provides individual prediction intervals of the DV,
and confidence interval estimator approximates the mean of the
response variable
Regression diagnostics concerns with evaluating the prediction
model and the assumptions of the error variable
We look at the dominant points inducing the regression line for
assessing the prediction model, whereas much of the diagnostics
concentrates on the characteristics of the residuals
31 / 31
86. BUSINESS STATISTICS II
Lecture – Week 14
Antonio Rivero Ostoic
School of Business and Social Sciences
1st
April 2014
AARHUS
UNIVERSITYAU
87. Today’s Outline
Scaling and transformations
Standard error of estimates and standardized values
Step-by-step example with simple linear regression analysis
2 / 24
88. Scaling and transformations
Sometimes data transformation is needed in order obtain
e.g. a normal distribution
Transformations are mathematical adjustments applied to
scores in an attempt to make the distribution of the
outcomes fit requirements
Scaling (and re-scaling) is a linear transformation based on
proportions where the scores are enlarged or reduced
3 / 24
89. Data transformation
In a simple linear regression analysis we can perform a transformation
of both the explanatory and the response variables
For example in linear regression we may need to transform the data:
– when the residuals have a skewed distribution or they show
heteroscedasticity
– to linearize the relationship among the IV and the DV
– but also when the theory suggest a transformed expression
– or to simplify the model in a multiple regression model
4 / 24
90. Scaling and transformations
Examples of transformations of the variable x are:
– Square root:
√
x
– Reciprocal: 1/x
– Natural log: ln(x) or log(x)
– Log 10: log10(x)
In linear regression we use least squares fitting
ª this transformation allows the residuals to be treated as a continuous
differentiable quantity
5 / 24
91. Logarithmic transformations
linear regression analysis
Model
Linear
Linear-log
Log-linear
Log-log
Transformation
None
x = log(x)
y = log(y)
x = log(x)
y = log(y)
Regression equation
y = β0 + β1x
y = β0 + β1 log(x)
log(y) = β0 + β1x
log(y) = β0 + β1 log(x)
ª log are natural logarithms with base e ≈ 2.72
ª The term ‘level’ is also used instead of ‘linear’ in logarithmic transformations
6 / 24
92. Logarithmic transformations
linear regression analysis
Model
Linear
Linear-log
Log-linear
Log-log
Interpretation
A one unit increase in x would lead to a β1 increase/decrease in y
A one percent increase in x would lead to a β1/100
increase/decrease in y
A one unit increase in x would lead to a β1 ∗ 100% increase/
decrease in y
A one percent increase in x would lead to a β1%
increase/decrease in y
ª In econometrics, log-log relationships are referred as “elastic” and the
coefficient of log(x) as the elasticity
7 / 24
93. Standard Error of Estimates
SE = square root of the proportion of the squared differences
between criterion’s predicted and observed values and the df
The squared differences between criterion’s predicted and observed
values corresponds to the Residual SS (SSE in Anova)
ª it represents the unexplained variation in the model (or model deviance)
The df equals number of cases − number predictors in the model −1
ª in a simple linear regression model there is only one predictor, and df equal n − 2
Thus most of the calculation for the SE of estimates corresponds to
the Residual SS
8 / 24
94. SE and Residual SS
SSE in SPSS
After having the data, to obtain the SSE we need first the predicted
values of our model
Analyze Regression Linear
• And in Save choose the Unstandardized option in Predicted
Values
9 / 24
95. SE and Residual SS
SSE in SPSS (2)
Then we calculate by hand the residuals (yi − ˆyi) in a new variable
created in the Variable View. We name this variable as RESID
• Then we go to
Transform Compute Variable...
and place RESID in Target Variable , and make the subtraction
operation with the expression:
DV − PRE 1
10 / 24
96. SE and Residual SS
SSE in SPSS (3)
The next step is to obtain the square of the residuals, and we the
recent created variable (named RESID) for this.
Thus transformation of the residual values to their squares is
obtained after we place RESID is in Target Variable and type in the
Numeric Expression field the square of the values:
RESID ∗∗
2
11 / 24
97. SE and Residual SS
SSE in SPSS (4)
The sum of squares of the residuals, which is the numerator of the
SE, is obtained when we sum the values of this last variable
Analyze Reports Report Summaries in Columns...
and choose RESID for the Data Columns and select Display
grand total in Options . The Residual SS or SSE is given in the
Report of the Statistics Viewer as Grand Total.
ª in SPSS the SE of estimates is given in Model Summary, and the SSE and
df values are in the ANOVA table
12 / 24
98. Standardized values
Standardized values have been transformed into a customary scale
Standardized Coefficient
In linear regression the standardized coefficient is the product
of the regression coefficient and the proportion of the standard
deviations of the DV and the IV
That is Beta (in SPSS) equals B ∗ (s(x)/s(y))
The standardized coefficient represents the change in the
mean of the dependent variable, in y standard deviations, for a
one standard deviation increase in the independent variable
13 / 24
99. Standardized values
Standardized Residuals
In SPSS we count with various types of residuals:
– RES 1 stands for unstandardized residuals
– SRE 1 stands for Studentized residuals
– ZRE 1 stands for standardized residuals
And Keller (pp 653) tells us about the standardization of
variables in general and of the residuals in particular
ª subtract the mean and divide by the standard deviation
14 / 24
100. Standardized residuals
We get the Excel output table with the standardized residuals
for Example 16.2 (Keller, pp 653)
Now let us look at the SPSS results for this data...
? Hmmmmmmmmmmmmmm.... ?
15 / 24
101. Standardized residuals
The term ‘standardized residual’ is not a standardized term
In Keller “Standardized” residuals are residuals divided by the
standard error of the estimate (residual) (cf. pp 653)
However in SPSS these values (cf. Excel output pp 653)
correspond to the “Studentized” residuals
ª (even though the definition is for the Studentized deleted residuals)
In SPSS a standardized residual is the residual divided by the
standard deviation of data
ª Studentized residuals (another form for standardization) have a constant variance,
and combine the magnitude of the residual and the measure of influence
16 / 24
102. Standardized residuals
speaking the same language
Residuals (unstandardized) are the difference between
observations and expected values:
ˆ = y − ˆy
In the case of a regression model standardized residuals are
normalized to a unit variance
The standard deviation or the square root of the variance of the
residuals corresponds to the sqrt of MSE (cf. lec. week 12)
ª this is also known as the root-mean-square deviation
Standardized residual = residual /
√
MSE
17 / 24
103. Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
Be aware that in this case the model is chosen in advance, and
we adopt a linear relationship between two variables
18 / 24
104. Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
1. Determine the response and the explanatory variables
2. Visualize the data through a scatter plot
3. Perform basic descriptive statistics
19 / 24
105. Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
4. Estimate the coefficients (intercept and slope)
5. Compute the fitted values and the residuals
6. Obtain the sum of squares for errors (Residual SS)
20 / 24
106. Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
7. Estimate the coefficients (intercept and slope)
a) standard error of estimate
b) test of the slope
c) coefficient of determination
21 / 24
107. Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
8. Perform the regression diagnostics
a) confidence regions for individual prediction intervals
b) confidence regions for the average prediction interval
9. Make a residual analysis
a) nonnormality, heteroskedasticity, nonindependence errors
22 / 24
108. Step-by-step simple linear regression analysis
EXAMPLE
[Population and avg. Household Size in Global Cities]
10. Detect outliers and influential observations
11. Interpret the results
12. Draw the conclusions
23 / 24
109. BUSINESS STATISTICS II
Lecture – Week 15
Antonio Rivero Ostoic
School of Business and Social Sciences
April
AARHUS
UNIVERSITYAU
110. Today’s Outline
Multiple regression model
• coefficients • estimation • conditions • testing • diagnostics
Working example
(SE estimates, and fitting the model with logarithmic transformations)
2 / 17
111. Multiple regression model
While a simple regression analysis has a single independent variable,
in a multiple regression analysis we count with several explanatory
variables for the response variable
A multiple regression model is represented by the equation
y = β0 + β1x1 + β2x2 + · · · + βkxk +
where y is the dependent variable, x1, x2, . . . , xk are independent variables,
and is the error variable
ª note that independent variables may be product of transformations from other
variables (which are independent or not)
In this case parameters β1, β2, . . . , βk are the regression coefficients,
whereas β0 represents the intercept
3 / 17
112. Multiple regression model
It is important to note that the introduced multiple regression
equation represents in this case an additive model
Thus the effect of each independent variable on the response is
assumed to be the same for all values of the other predictor
ª certainly we need to assess whether the additive assumption is realistic or not
Q. Do we still considering a linear relationship in the multiple
regression model?
A. Yes, whenever the model has linear coefficients
4 / 17
113. Graphical representation
Multiple regression models are graphically represented by a
hyperplane with k dimensions for IVs
– for k = 2 the relationships between the IVs and the DV is
represented by a regression plane within a 3D space
– for k 2 the model is represented by a regression or
response surface, a hyperplane (2D) that is not
conceivable to visualize for us
5 / 17
114. Interpreting Coefficients
In the multiple regression model β0 stands for the intersection of
the regression hyperplane, and represents the mean of y when
x’s equal 0
ª it makes only sense if the range of the data includes zero
βi, i = 1, . . . , k represent the change in the DV when xi changes
one unit while keeping the other IVs constant
When is possible, interpret the regression coefficients as the
ceteris paribus effect of their variation on the dependent variable
ª i.e. “other things being equal” interpretation
6 / 17
115. Estimation
The estimation of the coefficients is given by the least squares
equation
ˆy = b0 + b1x1 + b2x2 + · · · + bkxk
for k independent variables
And the error variable is estimated as
e+i = yi − ˆyi
7 / 17
116. Required conditions
The required conditions of the error variable assumed in
a simple linear regression model remain for multiple
regression analysis
ª that is errors are independent, normally distributed with mean 0
and a constant σ
The standard error of the estimate has less df than in the
simple regression analysis
ª we want SE close to zero
8 / 17
117. Testing the regression model
We test the validity of the model with the following hypotheses
H0 : β1 = β2 = · · · = βk = 0
H1 : βi = 0 for at least one i
ª The model is invalid in case we fail to reject the null hypothesis, whereas
whenever the alternative hypothesis is accepted then the model has some validity
Since in multiple regression models we count with several competing
explanatory variables for a response variable, then the assessment of
the model is central in the analysis
9 / 17
118. Testing the regression model
The test of significance of the model is based on the F statistics,
which means that we focus on the variation of the outcomes
The F-test is the proportion of the Mean Squares of Regression
and Residual
F =
SSR/k
SSE/n − k − 1
=
MSR
MSE
Recall that SSR represents the explained variation in the model,
whereas SSE is the unexplained variation
ª we want a high value for SSR and a low value of SSE, since this indicates
that most of the variation in the response variable is explained by the model
10 / 17
119. Testing the regression model
For the F-test the rejection of H0 applies when
F Fα, k, n−k−1
ª hence for a given α level we infer difference in the regression coefficients
in case that the F statistic value fails within the rejection region
Another way to assess the model is through the coefficient of
determination or R2, which interpretation is similar to the simple
regression analysis
ª we want R2
close to one
11 / 17
120. Test of individual coefficients
Based on the test of significance of the multiple regression model
we can perform individual t tests for each regression coefficient
H0 : βi = 0
H1 : βi = 0 (two-tail test)
The test statistic is
t =
bi − βi
sbi
12 / 17
121. Test of individual coefficients
And the confidential intervals are
bi ± tα/2, n−k−1· sbi
for i = 1, . . . , k
We reject the null hypothesis iff
t tα/2, n−k−1
(for a two-talied test)
13 / 17
122. Adjusted R-squared
When we add explanatory variables to the multiple regression model
we cannot decrease the value of the coefficient of determination
ª but it is possible to get a very high R2
even when the true model is not linear
Thus the adjusted R-squared is often used to summarize the multiple
fit as it takes into account the number of variables in the model
ª it is the coefficient of determination adjusted for df
Adjusted R2
= 1 −
MSE
MS Total
where MSE = SSE/(n − k − 1), and MS Total is the sample variance of y
Adjusted R2 ≤ R2
14 / 17
123. Regression diagnostics: multicollinearity
In addition to nonnormality and heteroskedasticity, the regression
diagnostics for a multiple model checks also for multicollinearity
Multicollinearity occurs when two or more independent variables
are highly correlated with one another
ª hence it is very difficult to separate their particular effects and influences on y
It causes inflated standard errors for estimates of regression
parameters and very large regression coefficients
Some consequences of this inflation are:
– a large variability of the samples, which causes that the sample
coefficients may be far from the population parameters, and hence
with wide confidence intervals
– small t statistics that suggest no linear relationship between involved
variables and the response variable, and such inference may be wrong
15 / 17
124. Multicollinearity
Multicollinearity can be avoided if one anticipates the problem
from theory or past experiences
ª multiple correlation scores can serve as a guide
Beware that two independent variables can be highly correlated
with each other (or with another predictor) but uncorrelated with
the dependent variable
ª they may be non-redundant suppressor variables
A stepwise regression (backward and forward) can serve to
minimize multicolliniearity in the modelling
ª these methods are based on improving the models fit
16 / 17
125. Multiple regression analysis
WORKING EXAMPLE
[Prediction of avg. Household Size in Global Cities]
Multiple regression analysis using globalcity-multiple.sav
17 / 17
126. BUSINESS STATISTICS II
Lecture – Week 17
Antonio Rivero Ostoic
School of Business and Social Sciences
April
AARHUS
UNIVERSITYAU
127. Today’s Outline
Model building in multiple linear regression
– predictors
Comparing regression models
Stepwise regression
Working example
– model building
– model comparison
Further issues (...)
2 / 16
128. Model building in multiple linear regression
The main goal in model building is to fit a model that explains
variation of the dependent variable with a small set of predictors
ª i.e. a model that efficiently forecasts the response variable of interest
When dealing with multiple independent variables, each subset of
x’s represents a potential model of explanation
ª for k predictors in the data set there are 2k
− 1 subsets of independent variables
Thus we want to establish a linear equation that predicts ‘best’ the
values of y by using more than one explanatory variable
Recall that to obtain a good model we need a R2
score closer to 1, a
small value for SE , and a large F statistic (which implies a small SSE)
3 / 16
129. Predictors
There are two types of independent variables to consider, and they
correspond to the numeric and the categorical variables
– Factors characterize qualitative data
– Covariates represent quantitative data
Predictors = Factors + Covariates
Sometimes an abstraction made on a numeric variable is called a factor that
explains the theory in the regression model, and covariate is simply a control
variable
4 / 16
130. Comparing Regression Models
cf. F-general in Note 2
To test of whether a model fits significantly better than a simpler model
In this case a restricted or reduced model is nested within an
unrestricted or complete model
ª i.e. one model is contained in another model
The test statistics can be based on the SSE or on the R2 values for
both models
Fchange =
(R2
U − R2
R) / df1
(1 − R2
U) / df2
where df1 = q = kU − kR (i.e. number of variable restrictions), and
df2 = n − kU − 1
5 / 16
131. Comparing Regression Models
F-general with sum of squares
On the other hand, by considering the sum of squares of the
residuals, the F statistics becomes
Fchange =
(SSER − SSEU) / df1
SSEU / df2
with the same df’s as before, and we take the absolute value
SPSS
We need to combine in Analyze Regression Linear the two models
with a different variable selection Method (Enter and Remove in
Blocks 1 and 2), and check R squared change in Statistics...
6 / 16
132. Comparing Regression Models
nested models
SPSS
The syntax procedure for comparing two nested models is..:
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA CHANGE
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT y
/METHOD=ENTER x1 x2
/METHOD=REMOVE x2.
7 / 16
133. Comparing Regression Models
...that for the data in Note 2 produces this outcome for both models:
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
Change Statistics
R Square
Change F Change df1 df2
Sig. F
Change
1
2
,55a
,304 ,297 67,45215 ,304 48,426 3 333 ,000
,41
b
,167 ,164 73,57910 -,137 32,811 2 333 ,000
Predictors: (Constant), years potential experience, years of education, years with current employera.
Predictors: (Constant), years of educationb.
ª the Fchange for Model 2 is for kU = 3 and kR = 1
this statistic is also equivalent to the F score in the analysis of
variance of both models
8 / 16
134. Stepwise regression
Variable selection
A sequential procedure to perform multiple regressions is
found in the stepwise method
It combines forward selection of predictors and backward
elimination of the independent variables
These are bottom-up and top-down processes based on
F scores and predefined p values
ª defaults in SPSS are 5% for IN, and 10% for OUT
9 / 16
138. Avg. household size in global cities
models 4 and 5
The F change in the two nested models is given in:
Model Summary
Model R
R
Square
Adjusted R
Square
Std. Error of
the Estimate
Change Statistics
R Square
Change F Change df1 df2
Sig. F
Change
1
2
,805a
,648 ,641 1,01542 ,648 84,113 5 228 ,000
,798
b
,637 ,631 1,02944 -,011 7,367 1 228 ,007
Predictors: (Constant), Percent Woman Heade of Households, Informal Employment, Average Income Q3
Person, Overall Child Mortality, Household Connection to Water
a.
Predictors: (Constant), Informal Employment, Average Income Q3 Person, Overall Child Mortality, Household
Connection to Water
b.
13 / 16
139. Avg. household size in global cities
the final model?
Estimate Std. Error t value Pr(|t|)
(Intercept) 5.4130 0.3705 14.61 0.0000
x10 −0.0191 0.0031 −6.13 0.0000
x3 −0.0001 0.0000 −3.95 0.0001
x5 0.0790 0.0157 5.04 0.0000
x9 0.0131 0.0041 3.18 0.0017
x6 −0.0104 0.0038 −2.71 0.0072
And what about this other one..?
y = x4 + x5 + x6 + x8 + x9 + x10
14 / 16
141. Summary Conclusions
Find a parsimonious model that effectively explains y
Model comparison combines evaluation of the fits and the
significance of regression coefficients
ª available automated procedures
To compare nested models we use the F statistics
ª working example, and data in note 2
WORKING EXAMPLE:
“It seems that the inclusion of the ratio of woman head of households
improves the model, but does it contribute to explain the change in the
average of the household size in the global cities?”
16 / 16
142. BUSINESS STATISTICS II
Lecture – Week 18
Antonio Rivero Ostoic
School of Business and Social Sciences
April
AARHUS
UNIVERSITYAU
144. Polynomial regression
Polynomial regression is a particular case of a regression model that
produces curvilinear relationship between response and predictor
Recall that simple regression equations represent first-order models
y = β0 + β1x +
Here the order of the equation p equals 1 and the relation between
the predictor and the response is depicted by a regression line
ª the model has a ‘degree 1 polynomial’
We can have regression equations with several independent
variables that are polynomial models and still having just one
predictor variable
Remember that when the parameters in the equation are linearly related,
then the polynomial regression model is considered as linear
3 / 20
145. First order and polynomial regression models
• First order model with two predictors: x1 and x2
y = β0 + β1x1 + β2x2 +
• First order model with k predictors: x1, . . . , xk
y = β0 + β1x1 + β2x2 + · · · + βkxk +
• Polynomial model with one predictor variable x and order p
y = β0 + β1x + β2x2
+ · · · + βpxp
+
ª thus a predictor variable can have various orders or powers
4 / 20
146. Second-order models
• A second-order (polynomial) model with a single predictor variable
has p = 2 and the equation represents a quadratic response
function depicted by a parabola
ª a ‘degree 2 polynomial’ or quadratic polynomial
y = β0 + β1x + β2x2
+
β1 controls for translation parameter of the parabola, and β2 for its
curvature rate
5 / 20
147. Quadratic effect of the regression coefficient
second-order model with β2x2
β2 = 1
x
y
convex
β2 = −1
x
y
concave
6 / 20
148. Third-order models
• A third-order (polynomial) model with a single predictor variable
has p = 3 and the equation represents a cubic response function
and depicted as a sigmoid curve
ª a ‘degree 3 polynomial’
y = β0 + β1x + β2x2
+ β3x3
+
there are three regression coefficients that control for two
curvatures
7 / 20
149. Cubic effect of the regression coefficients
third-order model (β1 0 and β2 0)
β3 0
x
y
β3 0
x
y
8 / 20
150. Higher-order models and several predictor variables
Models with order 3 are seldom used in regression analysis
ª typically because of the overfitting in the model and the poor prediction power
However, so far we have seen multiple regression equations
involving several predictors that are related in an additive model
ª that is, the effect of each IV was not influenced by the other variables
As illustration, consider a monomial model with two predictors (from
the WORKING EXAMPLE)
y = 5.47 − .03 x10 + .02 x9
(avg. household size as a function of access to water and informal employment)
for x9 = 1 then ˆy = 5.49 − .03 x10
for x9 = 50 then ˆy = 6.47 − .03 x10
for x9 = 99 then ˆy = 7.45 − .03 x10
9 / 20
151. Additive model with 2 predictors
0 20 40 60 80 100
23456789
x
y
y^ = 5.49 + −0.03x
10 / 20
152. Additive model with 2 predictors
0 20 40 60 80 100
23456789
x
y
y^ = 5.49 + −0.03x
y^ = 6.47 + −0.03x
11 / 20
153. Additive model with 2 predictors
0 20 40 60 80 100
23456789
x
y
y^ = 5.49 + −0.03x
y^ = 6.47 + −0.03x
y^ = 7.45 + −0.03x
12 / 20
154. Comparing models
Note 3
Four models: (1) first order; (2) second order; (3) linear-log; (4) log-linear
a) The t test is used to compare models (1) and (2)
ª since (1) is the reduced version of (2) we can use the Fchange score for nested
models where t =
√
F
b) Models (1) and (3) are not nested; we choose one with a better fit
c) Models (2) and (3) are neither nested and we rely on R
2
since they have a
different number of predictors (performances are almost identical here...)
d) Comparing a log-linear model with an untransformed response requires
another approach and it is out of the scope...
13 / 20
155. Regression models with interaction
Many times the effect of a certain explanatory variable on the
response is affected by the value of another predictor of the model
In such cases there is an interaction between the two predictors,
and the influence of these variables on y does not operate in a
simple additive pattern
A first order model with interaction:
y = β0 + β1 x1 + β2 x2 + β3 x1 x2 +
where the effect of x1 on the response is influenced by x2 and vice-versa
An interaction exists in the regression model when a regression
coefficient varies with a different value of another coefficient
ª not easy to interpret
14 / 20
156. Example
A model with two the predictors and interaction from the
WORKING EXAMPLE
y = 6.58 − .04 x10 + .00 x9 + .00 x10 x9
produces no interaction because in the model b3 equals zero
ª this may be explained by the high correlation between y and x9
15 / 20
157. Estimating multiple regression with interaction
An important concern with multiple regression is that lower order
variables are highly correlated with their interactions
Centering and standardization of predictors correct this problem
ª Centering implies re-scaling the predictors by subtracting the mean from each
observation, and by dividing the centering scores with the standard deviation of the
variable we standardize the predictors
Model with interaction from the WORKING EXAMPLE with standardized
values
y = 1.11 − .50 x10 + .35 x9 + .16 x10 x9
for x9 = 1 then ˆy = 1.46 − .34 x10
for x9 = 2 then ˆy = 1.81 − .18 x10
which means that the fitted lines are not parallel as with the additive model
16 / 20
158. Higher order models with interaction
Higher order models with interaction produce quadractic, cubic
(W, M or other shape) relationships between the response and
each of the predictors
Model with a quadratic relationship and interaction
y = β0 + β1x1 + β2x2 + β3x2
1 + β4x2
2 + β5x1x2 +
will produce parabolas with crossing trajectories...
17 / 20
159. Regression with dummy variables
Until now we have been doing regression analysis using interval
scales of the data only
However in many cases we may count with qualitative data that are
represented by a nominal scale, and treating this type of data as
interval brings misleading results
We can perform regression analysis by using dummy or indicator
variables, which are artificial variables that encode the belonging or
not of an observation to a certain group or category
ª code 1 for belonging, and code 0 otherwise
Indicator or dummy variables are just for classification purposes and
the magnitude used is not applicable in this context
18 / 20
160. Regression with dummy variables
For 3 categories we use 2
indicator variables
I 1 I 2
Category 1 1 0
Category 2 0 1
Category 3 0 0
For 4 categories we use 3
indicator variables...
I 1 I 2 I 3
Category 1 1 0 0
Category 2 0 1 0
Category 3 0 0 1
Category 4 0 0 0
How many dummies are required for a variable having two categories?
19 / 20
161. Dummies with command-line
We need to create a number of dummy variables according to
the existing number of categories.
Syntax in SPSS:
RECODE varlist_1 (oldvalue=newvalue) ... (oldvalue=newvalue)
[INTO varlist_2].
[/varlist_n].
EXECUTE.
20 / 20
162. BUSINESS STATISTICS II
Lecture – Week 19
Antonio Rivero Ostoic
School of Business and Social Sciences
May
AARHUS
UNIVERSITYAU
164. Qualitative independent variables
The effects of qualitative information on a response variable may
be an important result, and we need ways to include this type of
data in a regression model
Qualitative information correspond to a nominal scale that my
require a pre-coding of the data into artificial variables known as
dummies or indicator variables
Recall that a nominal scale includes different categories or
groups that serve to classify the observations, and qualitative
predictors are factors
A dichotomous factor has two categories (e.g. gender), whereas
a polytomous factor has more categories (e.g. seasons)
3 / 24
165. Indicator variables (dummies)
Indicator variables have only two values, typically 1 and 0, and for
m categories in the variable, we require m − 1 indicator variables
ª this means that there is an omitted category in the representation to avoid
redundancy
Ii =
1 if obs. belongs to a category ci
0 otherwise.
The omitted category represents the baseline or ‘reference’
category to which we compare the other groups
ª the decision to choose the omitted category is arbitrary, and it leads to the
same conclusion
If we do not omit one category and include indicator variables for
all categories in the regression model, then there is a perfect
multicollinearity among these independent variables
ª a phenomenon known as the dummy variable trap
4 / 24
166. Dataset for Notes 3 and 4
training data337.sav
Dependent variable: Wage, average hourly earnings (DKK)
Independent variables: Educ, education (years)
Tenur, current employment (years)
Exper, potential experience (years)
Female, gender (0: male, 1: female)
(Male, gender (0: female, 1: male))
5 / 24
167. Simple regression with an indicator variable
(dichotomous factor)
“The gender wage gap”
Are women paid less than men according to the data?
Wage = β0 + β1 Female +
Estimate Std. Error t value Pr(|t|)
(Intercept) 161.9242 5.5013 29.43 0.0000
Female -62.8700 8.1117 -7.75 0.0000
Women earn 62.87 DKK per hour less than men
6 / 24
168. Simple regression with an indicator variable II
(dichotomous factor)
For a variable Male = 1 − Female, and the model:
Wage = β0 + β1 Male +
we get the following results:
Estimate Std. Error t value Pr(|t|)
(Intercept) 99.0542 5.9612 16.62 0.0000
Male 62.8700 8.1117 7.75 0.0000
Likewise men earn 62.87 DKK per hour more than women
7 / 24
169. The dummy variable trap
What about this model?:
Wage = β0 + β1Female + β2Male +
In this case there is a duplicated category and the independent
variables are perfectly multicollinear
Male is an exact linear function of Female and of the intercept
ª Male = 1 − Female implies that Male + Female = 1
8 / 24
170. Multiple Regression with a dichotomous indicator variable
(factor and covariates)
An additive dummy-regression model:
Wage = β0 + β1Female + β2Educ + β3Tenure +
• (We already know that the model fit or R2
never decreases when we
add to the model new independent variables)
• The model now assumes that – besides gender – there is an effect
of education and tenure on the wage levels
• Since the model is additive the predictors are independent to each
other, and the regression equation fits identical slopes for all the
categories in gender and for the other predictors as well
ª which implies parallel regression lines in the scatterplot
9 / 24
171. Testing partial coefficients
For model:
Wage = β0 + β1Female + β2Educ + β3Tenure +
Test the partial effect of gender:
H0 : β1 = 0
H1 : β1 = 0
Test the partial effect of education:
H0 : β2 = 0
H1 : β2 = 0
Test the partial effect of tenure:
H0 : β3 = 0
H1 : β3 = 0
10 / 24
172. Testing partial coefficients
The t-test is the coefficient divided by the SE of the estimate
ti =
bi − βi
sbi
Estimate Std. Error t value Pr(|t|)
(Intercept) -49.2529 20.5869 -2.39 0.0173
Female -46.7547 7.1544 -6.54 0.0000
Education 13.9233 1.4564 9.56 0.0000
Tenure 3.2485 0.4729 6.87 0.0000
11 / 24
173. Fitted values by gender: Additive model
Wage = β0 + β1Female + β2Educ + β3Tenure +
years of education
18161412108
UnstandardizedPredictedValue
300,00000
200,00000
100,00000
,00000
Fit line for Total
Female
Male
R2 Linear = 0,435
Linear Regression
12 / 24
174. Fitted values by gender: Additive model
Wage = β0 + β1Female + β2Educ + β3Tenure +
years of education
18161412108
UnstandardizedPredictedValue
300,00000
200,00000
100,00000
,00000
Female
Male
Male: R2 Linear = 0,575
Female: R2 Linear = 0,715
Linear Regression
13 / 24
175. Multiple regression with interaction: factor and covariate
(indicator variable and continuous variable)
• Many times the additive models are unrealistic, and theory suggest
different slopes for different categories
• To capture such difference in slopes we assume statistical interaction
among independent variables
Wage = β0 + β1Female + β2Educ + β3(Female × Educ) +
Estimate Std. Error t value Pr(|t|)
(Intercept) -18.1088 26.1498 -0.69 0.4891
Female -23.9223 41.7171 -0.57 0.5667
Educ 13.7154 1.9550 7.02 0.0000
Female × Educ -2.6485 3.1844 -0.83 0.4062
ª The effect of gender on wage is influenced by education and vice-versa (no sig.)
14 / 24
176. Fitted values by gender: Interaction model
Wage = β0 + β1Female + β2Educ + β3(Female × Educ) +
years of education
18161412108
UnstandardizedPredictedValue
250,00000
200,00000
150,00000
100,00000
50,00000
,00000
Female
Male
Male: R2 Linear = 1
Female: R2 Linear = 1
Linear Regression
15 / 24
177. Testing interaction
We can test for interaction in the model
Wage = β0 + β1Female + β2Educ + β3(Female × Educ) +
The null hypothesis is that there is no interaction in the model, i.e.
H0 : β3 = 0
H1 : β3 = 0
We apply now the F-general (or F incremental) statistics...
Fchange =
(R2
U − R2
R) / df1
(1 − R2
U) / df2
where df1 = q = kU − kR (i.e. number of variable restrictions), and
df2 = n − kU − 1
ª In this case the complete or unrestricted model has the statistical
interaction term whereas the reduced model does not have this term
16 / 24
178. Testing interaction
In an additive dummy-regression model it is possible to test for effect
of categorical variable on the response controlling for a quantitative
predictor, and vice-versa ( i.e. test for effect of a covariate controlling
for factor)
e.g. test gender on wage controlling for education, and test
education controlling for gender
In such cases the null hypothesis is that the coefficient of the variable
to be tested equals zero
17 / 24
179. Multiple Regression with a polytomous indicator variable
Data from Keller xm16-02.sav
A polytomous indicator variable has more than two categories:
Price = β0 + β1Odometer + β2Color +
I1 =
1 if colour is white
0 otherwise.
I2 =
1 if colour is silver
0 otherwise.
• The reference category is ‘all other colours’ that is represented
whenever I1 = I2 = 0
18 / 24
180. Multiple Regression with a polytomous indicator variable
• In a multiple regression with a polytomous indicator variable we obtain
coefficients each group except for the reference category
Estimate Std. Error t value Pr(|t|)
(Intercept) 16.8372 0.1971 85.42 0.0000
Odometer -0.0591 0.0051 -11.67 0.0000
White 0.0911 0.0729 1.25 0.2143
Silver 0.3304 0.0816 4.05 0.0001
• The t-test is adequate for the covariate (i.e. odometer), but for color we
prefer to test the two indicator variables simultaneously, and this is
because the election of the reference category is arbitrary
ª the F test allow us to do this
• Part of the interpretation of the results assumes that one or more
indicator variables equal 0
19 / 24
182. Interpreting Results
Recall that the interpretation in regression analysis is on average, it
considers the units of measure of the involved variables, and in additive
models is by holding constant the values of the other variables
(including the error)
In regression with indicator variables the coefficients corresponding to
these variables represent a variation on the response with respect to
the other groups in the model
The statistical significance of the regression coefficients comes after the
interpretation of their effects on the response and not alone
The conclusions should account for the values of the regression
coefficients and the statistical significance of these outcomes
21 / 24
183. Interpreting logarithmic transformations
log is a natural logarithm base e
In Note 3 models (3) and (4) have logarithmic transformations on
variables, and we will see how to interpret the results in these models
Model (3), level-log
Wage = β0 + β1Educ + β2 log(Tenure) +
In this linear-log model, a one percent increase in years of experience
(tenure) leads to β2/100% change in wage
unit ∆Wage
% ∆Tenure
= b2/100
ª Since b2 = 31.32, then –holding education constant– a one percent change in
tenure is associated with 0.3132 DKK increase hourly in wage on average
22 / 24
184. Interpreting logarithmic transformations
Model (4), log-level
log(Wage) = β0 + β1Educ + β2Tenure +
This is a log-linear model where a one unit increase in the predictor
leads to bi ∗ 100% change on wage
% ∆Wage
unit ∆xi
= bi ∗ 100
• Holding education constant, a one year increase with current employer is
associated with 2.5% increase in wage per hour on average
• Holding tenure constant, 1 year more of education is associated with 10.4%
increase in wage per hour on average
23 / 24
185. Interpreting logarithmic transformations
log-log models are interpreted as elasticity
i.e. the ratio percent change in one variable to the percent
change in another variable
% ∆y
% ∆xi
• One percent change in xi is associated with bi% change in y
(ceteris paribus)
partial elasticity when we hold constant other variables
24 / 24
186. BUSINESS STATISTICS II
Lecture – Week 20
Antonio Rivero Ostoic
School of Business and Social Sciences
May
AARHUS
UNIVERSITYAU
189. Exam 2013
The exam 2013 had 8 questions, and some were based on a
single data set
The data set contained 13 labor market related variables
(though one transformed) among 762 observations form men
and women
ª however not all variables were needed to answer the questions
After you read carefully the instructions, check the data with the
software, and put labels to the variables with the provided
descriptions and units of measure (if specified)
4 / 27
190. Comparing groups
Q1a) Do wages differ by gender?
• Implied variables: Wage K, and gender B
• Groups to compare: Wages for men and wages for women
Plot data in SPSS
ª Plot histogram for wage grouped by gender (B)
Graphs Legacy Plots Histogram where the variable is paneled by the two
groups (optional normal curve)...
5 / 27
191. Comparing groups
Q1a) Do wages differ by gender?
We compare the means of these two groups through the t-test
However we need to see first whether these groups have equal
variances or not through
ª to know whether to use the pooled or the unpooled version of the t test
Thus we perform the F test for equality of variances first
Obtain basic descriptive statistics in SPSS
Analyze Reports Case Summaries... where the variable is paneled by the
two groups...
ª uncheck Display Cases and choose statistics
6 / 27
192. Review: F test and sample variance
H0 : σ2
1/σ2
2 = 1
H1 : σ2
1/σ2
2 = 1
F =
σ2
1/s2
1
σ2
2/s2
2
=
s2
1
s2
2
for v1 = n1 − 1 and v2 = n2 − 1
where for 1, 2, ..., n observations:
variance
s2
=
n
i=1(xi − x)2
n − 1
7 / 27
193. Review: t test and sample mean
independent samples and H0 : µ1 = µ2
pooled
t =
(x1 − x2) − (µ1 − µ2)
s2
p
1
n1
+ 1
n2
where s2
p =
(n1 − 1)s2
1 + (n2 − 1)s2
2
n1 + n2 − 2
unpooled
t =
(x1 − x2) − (µ1 − µ2)
s2
1
n1
+
s2
2
n2
v = n1 + n2 − 2 when σ2
1 = σ2
2
v =
s2
1/n1 + s2
2/n2
2
(s2
1/n1)2
n1−1 +
(s2
2/n2)2
n2−1
when σ2
1 = σ2
2
which for 1, 2, ..., n observations:
mean
¯x =
n
i=1 xi
n
8 / 27
194. F test to wages by gender
After obtained the F statistics, we check the critical values with
the respective degrees of freedom and the standard alpha value
ª use the Excel calculator or/and table for F-distribution
In this case the F ratio is within the critical region, which means
that we reject H0 of equal variances, i.e. F ratio = 1
ª the p-value indicates that the result is statistically significant
Both outcomes suggest that there evidence to infer that the ratio
of variances differ
We know now that we can proceed with the analysis applying the
unpooled t test
9 / 27
195. t test to wages by gender
Q1a) Do wages differ by gender?
Although in this part the calculations are written by hand; you
can compare your results with the outcomes from SPSS
t test in SPSS
Analyze Compare Means Independent-Samples T Test... and the test variable
K is paneled by the two groups in B
ª We Define Groups... by putting 0 and 1 that characterize the gender variable
Confidence intervals are also given in the table of the t test for
independent samples...
10 / 27
196. Comparing groups
Q1b) Find a 95% confidence interval for tenure by gender
• Implied variables: Tenure G, and gender B
• Groups to compare: Tenure for men and for women
In this case is the pooled t test with the confidence interval
estimators
11 / 27
197. Review: Confidence intervals for t test
pooled
Confidence interval estimator of µ1 − µ2 when σ2
1 = σ2
2
(x1 − x2) ± tα/2 s2
p ·
1
n1
+
1
n2
for v = n1 + n2 − 2
12 / 27
198. Comparing groups
Q1c) Find a 95% CI by gender with 15 years of education
• Implied variables: Education I, and gender B
• Groups to compare: Men and women, 15 yrs. of educ.
In this case the difference is between population proportions
13 / 27
199. Review: Confidence Interval of p1 − p2
(ˆp1 − ˆp2) ± zα/2
ˆp1(1 − ˆp1)
n1
+
ˆp2(1 − ˆp2)
n2
for unequal proportions, and n1ˆp1, n1(1 − ˆp1), n2ˆp2, and n2(1 − ˆp2) 5
For the number of successes in the two populations, x1 and x2
ˆp1 =
x1
n1
and ˆp2 =
x2
n2
14 / 27
200. Test of proportion p1 − p2
As when we compared means, the calculations for proportions should be
made manually. However to find x1, x2, and n1, n2 you can use SPSS
Proportion success in SPSS
Analyze Descriptive Statistics Crosstabs... where I is contrasted by B
- Alternatively, you can create new indicator variable, say PL15 = 1 iff I 3
Transform Recode into Different Variables , and
in Old and New Values... recode to 1 the Range category 3 through 4, and 0
otherwise, after naming the new variable
Then get a report
Analyze Reports Case Summaries... where PL15 is the Variable that is
grouped by B, and specifying Number of Cases and Sum in Statistics
15 / 27
201. Test of proportion p1 − p2
By combining the two categorical variables, we obtain the
sample estimates for both proportions that are men and women
ª and we are able to pursue with the arithmetic calculations
For 95% confidence interval, the multiplier zα/2 is 1.96
ª the score comes from the z table in Keller for 1 − (.05/2)
16 / 27
202. Comparing groups
Summary comment
Q1a. Men earn significantly more than women
Q1b. With 95% confidence the difference interval of 2.8 to 4, men
have more years of market experience than women
Q1c. With 95% confidence the difference interval of 10 to 24
percentage points, men have less schooling level than women
• Relate the implied variables of wage, tenure, and education
level for both groups
• Explain why the differences might occur in that way...
ª eventually using other variables from the data
17 / 27
203. Regression analysis
Q4a) Estimation and regression diagnostics
for an additive log linear regression model
Dependent variable: M, the natural logarithm of K, wage (hourly)
Independent variables: B, gender (male = 1, female = 0)
C, education (years)
G, market experience or tenure (years)
18 / 27
204. Regression analysis
Q4. where log = ln
The regression equation represents a log level model:
ln(Wage) = β0 + β1 Male + β2 Educ + β3 Tenure +
Estimate Std. Error t value Pr(|t|)
(Intercept) 4.4353 0.0594 74.65 0.0000
Male 0.1268 0.0132 9.63 0.0000
Educ 0.0431 0.0031 13.79 0.0000
Tenure 0.0089 0.0019 4.69 0.0000
19 / 27
205. Model diagnostics
Multiple regression
After performing linear regression analysis...
Check the assumptions
| x ∼ N(0, σ2
)
and evaluate multicollinearity by:
• looking at the correlation among the variables
• viewing the histogram of the standardized residuals for the model
• plotting the residuals against predicted values
20 / 27
206. Regression results
Q4b) Interpretation of the estimation results
The fitted model is
ˆy = 4.435 + .127 · B + .043 · C + .009 · G
This means that, men earn 12.7% more than women, and that wages
raise by 4.3% and by almost 1% for an extra year of education and
market experience respectively
ª interpretation as ceteris paribus or all things being equal
Then we interpret individual outcomes in the log level model as the
proportion of percentage change in y by a unit change in xi
21 / 27
207. Regression results
Q4c.
However sometimes we need fitted values in the units of measure
of the untransformed response given a set of values in the IVs
ª e.g., How much wage is expected for a man or a woman having 12 years of
education and 15 years of experience?
In this case we apply the exponential function to both hands of the
regression equation
eln(ˆy)
= e(b0+b1x1+b2x2+b3x3)
where e ≈ 2.718282
It means for the model that we obtain the value of K rather than the
value of ln(K)
22 / 27
208. Regression results
Q4c.
The fitted value for a man with 12 years of education, and 15 years of
market experience is
ˆy = 4.435 + .127 · 1 + .043 · 12 + .009 · 15 = 5.213
and the expected return on wage is
e5.213
= 183.64 hourly (in DKK)
On the other hand, for a woman with similar level of education and
experience the fitted value is
ˆy = 4.435 + .127 · 0 + .043 · 12 + .009 · 15 = 5.086
and the expected return on wage is
e5.086
= 161.69 hourly (in DKK)
23 / 27
209. Prediction interval
Manual regression analysis
Q6) Construct a 95% prediction interval for y given x = 30
Where the fitted line for n = 100 and R2 = .755 is
ˆy = 6.92 + .237x
Some descriptives for location and dispersion are:
• x = 13 and s2
x = 121
• y = 10 and s2
y = 9
And the Anova table shows:
• SSR = 672.61, df = 1
• SSE = 218.39, df = 98
24 / 27
210. Prediction interval
Q6.
The prediction interval for ˆy∗ | x∗ = 30
ˆy∗
x∗=30 ± tα/2, n−k−1 · MSE · 1 +
1
n
+
(x∗ − x)2
(n − 1) · s2
x∗
where
MSE = s2 =
SSE
n−k−1
ª In this case, the left hand of the prediction interval estimate correspond to the
product of the regression coefficients with k = 1
ª the multiplier can be obtained from the MS Excel calculator for the t distribution
25 / 27
211. Prediction interval
Q6. without Anova table
It is also possible to calculate SSE from the sample variances and R2
SSE = (n − 1) s2
y −
s2
xy
s2
x
where the covariance s2
xy = R2
· s2
x · s2
y (this is from R2
=
s2
xy
s2
xs2
y
)
Or alternatively:
SSE = SSy · (1 − R2
)
where SSy = (n − 1) · s2
y
Thus there are various possibilities fot the calculations...
26 / 27