Moving Beyond Linearity [ISLR.2013.Ch7-7]
Theodore Grammatikopoulos∗
Tue 6th
Jan, 2015
Abstract
Linear models are relatively simple to describe and implement, and have advantages
over other approaches in terms of interpretation and inference. However, standard
linear regression can have significant limitations in terms of predictive power. This is
because the linearity assumption is almost always an approximation, and sometimes a
poor one. In article 6 we see that we can improve Least Squares Estimation methods by
using Ridge Regression, Lasso, Principal Components Regression (PCR), Principal Least
Squares (PLS), and other techniques. In that setting, the improvement is obtained by
reducing the complexity of the linear model, and hence the variance of the estimates.
But we are still using a linear model, which can only be improved so far. Here,
we also relax the linearity assumption but we still attempting to maintain as much
interpret-ability as possible. We do this by examining very simple extensions of linear
models like Polynomial Regression and Piece-wise Step Functions, as well as more
sophisticated approaches such as Splines, Local Regression, and Generalized Additive
Models (GAMs).
## OTN License Agreement: Oracle Technology Network -
Developer
## Oracle Distribution of R version 3.0.1 (--) Good Sport
## Copyright (C) The R Foundation for Statistical Computing
## Platform: x86_64-unknown-linux-gnu (64-bit)
D:20150106213538+02’00’
∗
e-mail:tgrammat@gmail.com
1
1 Non-Linear Modeling
In this lab we re-analyze the Wage data set considered in the examples throughout Chapter
7 of “ISLR.2013” [James et al., 2013]. We will see that many of the complex non-linear
fitting procedures discussed in this Chapter can be easily implemented in R.
library(ISLR)
attach(Wage)
1.1 Polynomial Regression
As a first attempt to describe employee’s wage in terms of their age, we will try a poly-
nomial regression fit. We will determine the best polynomial degree to do so and we will
afterwards examine to what extent this could be successful.
The reason to search for a non-linear fit to describe the wage ∼ age dependence, it is
almost apparent by the corresponding plot of the two variables, shown in Figure 1 below.
First, the employee data set can be easily distinguished in two groups, a “High Earners
Group” and a “Low Earners” one. Secondly, the wage ∼ age dependence of this “Low
Earners” employees group is certainly non-linear and most probably of a higher than the
second polynomial degree.
Next, we should decide on the exact degree of the polynomial to use. In the article “Linear
Model Selection and Regularization (ISLR.2013.Ch6-6)”, we studied two different ways to do
so. Either by applying some subset variable selection method or by using cross-validation.
Here, we will discuss an alternative approach, the so-called Hypothesis Testing.
More specifically, we can fit models ranging from linear to degree-5 polynomial and seek to
determine the simplest model which is sufficient to explain the wage ∼ age relationship.
To do so we perform analysis of variance (ANOVA, F-test), by using the anova() function,
in order to test the null hypothesis that a model M1 is sufficient to explain the data against
the alternative hypothesis that a more complex model M2 is required. In order to use the
anova() function, M1 and M2 must be nested models: the predictors in M1 must be a
subset of the predictors in M2. In this case, we fit five different models and sequentially
compare the simpler model to the more complex model.
lm.Poly1.fit <- lm(wage ~ age, data = Wage)
lm.Poly2.fit <- lm(wage ~ poly(age, 2), data = Wage)
lm.Poly3.fit <- lm(wage ~ poly(age, 3), data = Wage)
2
lm.Poly4.fit <- lm(wage ~ poly(age, 4), data = Wage)
lm.Poly5.fit <- lm(wage ~ poly(age, 5), data = Wage)
Figure 1: 4-degree Polynomial fit for Employees’ Wages vs their Age. The probability of an
employee to be a high earner person as a function of her age is also depicted.
anova(lm.Poly1.fit, lm.Poly2.fit, lm.Poly3.fit, lm.Poly4.fit,
lm.Poly5.fit)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value comparing the linear Model 1 to the quadratic Model 2 is essentially zero
(< 1015
), indicating that a linear fit is not sufficient. Similarly the p-value comparing the
quadratic Model 2 to the cubic Model 3 is very low (∼ 0.0017), so the quadratic fit is
also insufficient. The p-value comparing the cubic to the degree-4 polynomials, Model
3 and Model 4, is approximately 5% while the degree-5 polynomial Model 5 seems
unnecessary because its p-value is ∼ 0.37 with a not large F-statistic. Hence, either a
cubic or a quartic polynomial appear to provide a reasonable fit to the data, but lower or
higher-order models are not justified.
Here, we choose to describe the wage ∼ age dependence by a quartic polynomial, i.e.:
wage ∼ Intercept + 0 ∗ age + 1 ∗ age2
+ 2 ∗ age3
+ 3 ∗ age4
.
The estimated coefficient calculated by the method can be retrieved by the following call
lm.Poly4.fit.Wage <- lm.Poly4.fit
coef(summary(lm.Poly4.fit.Wage))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
Next, we create a grid of values for age at which we want predictions, call the generic
predict() functions and calculate the standard errors
ageMinMax <- range(Wage$age)
age.grid <- seq(from = ageMinMax[1], to = ageMinMax[2])
4
preds.poly <- predict(lm.Poly4.fit.Wage, newdata = list(age = age
.grid),
se.fit = TRUE)
se.bands <- cbind(preds.poly$fit + 2 * preds.poly$se.fit, preds.
poly$fit -
2 * preds.poly$se.fit)
Finally, we plot the data and add the degree-4 Polynomial fit
par(mfrow = c(1, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(Wage$age, Wage$wage, xlim = ageMinMax, cex = 0.5, col = "
darkgrey",
xlab = "Age", ylab = "Wage")
title("Wage vs Age Fit n Degree-4 Polynomial [Wage]", outer =
TRUE)
lines(age.grid, preds.poly$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 2, col = "blue", lty = 3)
As shown in Figure 1, the employee data set can be easily distinguished in two groups, a
“High Earners Group” and a “Low Earners” one. To calculate the probability an employee
to annually earn more than 250k USD, we create the appropriate response vector for the
categorical variable of 1l (wage > 250)
glm.Poly4.binomial.fit.Wage <- glm(I(wage > 250) ~ poly(age,
4), data = Wage, family = binomial)
and make the predictions as before.
glm.Poly4.binomial.preds.Wage <- predict(glm.Poly4.binomial.fit.
Wage,
newdata = list(age = age.grid), se.fit = TRUE)
However, calculating the probability P (Wage > 250 | Age) and its corresponding confi-
dence intervals is slightly more involved than in the linear regression case. The default
5
prediction type for a glm() model is type="link", which what we use here. This means
we get predictions for the logit, i.e we have fit a model of the form
log
P(Y = 1 | X)
1 − P(Y = 1 | X)
= X , (1)
which means that the predictions, as well as its standard errors are of X form. Therefore,
if we have to plot the P (Wage > 250 | Age) as a function of employee’s age, we have to
transform the resulting fit accordingly, that is
P(Y = 1|X) =
exp( X)
1 + exp( X)
, (2)
or in R code
preds <- glm.Poly4.binomial.preds.Wage
pfit <- exp(preds$fit)/(1 + exp(preds$fit))
se.bands.logit <- cbind(preds$fit + 2 * preds$se.fit, preds$fit -
2 * preds$se.fit)
se.bands <- exp(se.bands.logit)/(1 + exp(se.bands.logit))
and plot the result which is shown in the left panel of Figure 1.
plot(age, I(wage > 250), xlim = ageMinMax, type = "n", ylim = c
(0,
0.2), xlab = "Age", ylab = "P(Wage>250|Age)")
points(jitter(age), (I(wage > 250)/5), cex = 0.5, pch = "|",
col = "darkgrey")
lines(age.grid, pfit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 2, lty = 3, col = "blue")
It is interesting to note here that the function
6
coef(summary(lm.Poly4.fit.Wage))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
returns a matrix whose columns are a basis of orthogonal polynomials, which essentially
means that each column is a linear combination of the variables age, age^2, age^3
and age^4. However we can also obtain a direct fit to the {age,age^2,age^3,age^4}
variable basis by demanding raw=TRUE in the the previous code as shown below. This
does not affect the model in a meaningful way. The choice of basis clearly affects the
coefficient estimates, but it does not affect the fitted values obtained.
# Direct Fit in {age,age^2,age^3,age^4} Basis
lm.Poly4.fit.Wage2 <- lm(wage ~ poly(age, 4, raw = TRUE), data =
Wage)
coef(summary(lm.Poly4.fit.Wage2))
## Estimate Std. Error
## (Intercept) -1.841542e+02 6.004038e+01
## poly(age, 4, raw = TRUE)1 2.124552e+01 5.886748e+00
## poly(age, 4, raw = TRUE)2 -5.638593e-01 2.061083e-01
## poly(age, 4, raw = TRUE)3 6.810688e-03 3.065931e-03
## poly(age, 4, raw = TRUE)4 -3.203830e-05 1.641359e-05
## t value Pr(>|t|)
## (Intercept) -3.067172 0.0021802539
## poly(age, 4, raw = TRUE)1 3.609042 0.0003123618
## poly(age, 4, raw = TRUE)2 -2.735743 0.0062606446
## poly(age, 4, raw = TRUE)3 2.221409 0.0263977518
## poly(age, 4, raw = TRUE)4 -1.951938 0.0510386498
Two other equivalents ways of calculating the same fit whereas protecting power terms of
age are the following:
# Direct Fit in {age,age^2,age^3,age^4} Basis
7
lm.Poly4.fit.Wage3 <- lm(wage ~ age + I(age^2) + I(age^3) + I(age
^4),
data = Wage)
coef(summary(lm.Poly4.fit.Wage3))
# Direct Fit in {age,age^2,age^3,age^4} Basis
lm.Poly4.fit.Wage4 <- lm(wage ~ cbind(age, age^2, age^3, age^4),
data = Wage)
coef(summary(lm.Poly4.fit.Wage4))
Comparing now the fitted values obtained in either case we found them identical, as
expected.
preds.raw <- predict(lm.Poly4.fit.Wage2, newdata = list(age = age
.grid),
se.fit = TRUE)
max(abs(preds.poly$fit - preds.raw$fit))
## [1] 8.739676e-12
Note:
The ANOVA method also works in more general cases, that is when terms other than the
orthogonal polynomials are also included. For example, we can use anova() to also
compare these three models
fit.1.Wage <- lm(wage ~ education + age, data = Wage)
fit.2.Wage <- lm(wage ~ education + poly(age, 2), data = Wage)
fit.3.Wage <- lm(wage ~ education + poly(age, 3), data = Wage)
fit.4.Wage <- lm(wage ~ education + poly(age, 4), data = Wage)
anova(fit.1.Wage, fit.2.Wage, fit.3.Wage, fit.4.Wage)
8
## Analysis of Variance Table
##
## Model 1: wage ~ education + age
## Model 2: wage ~ education + poly(age, 2)
## Model 3: wage ~ education + poly(age, 3)
## Model 4: wage ~ education + poly(age, 4)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2994 3867992
## 2 2993 3725395 1 142597 114.6595 < 2e-16 ***
## 3 2992 3719809 1 5587 4.4921 0.03413 *
## 4 2991 3719777 1 32 0.0255 0.87308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
giving an outcome which actually support the third case of model, i.e.:
Model 3 : wage ∼ education + poly(age, 3) .
Now comparing this new model with the one we have examined before, i.e.
wage ∼ Intercept + 0 ∗ age + 1 ∗ age2
+ 2 ∗ age3
+ 3 ∗ age4
,
we obtain the following results
# Split the data set in a Train and a Test Data Part
set.seed(356)
train <- sample(c(TRUE, FALSE), nrow(Wage), rep = TRUE)
test <- (!train)
# wage ~ poly(age,4) Model
lm.Poly4.fit <- lm(wage ~ poly(age, 4), data = Wage[train, ])
preds.polyNew <- predict(lm.Poly4.fit, newdata = Wage[test, ],
se.fit = TRUE)
# wage ~ education + poly(age,3)
fit.3.Wage <- lm(wage ~ education + poly(age, 3), data = Wage[
train,
])
preds.fit.3 <- predict(fit.3.Wage, newdata = Wage[test, ], se.fit
= TRUE)
9
# MSEs calculation
mse.polyNew <- mean((Wage$wage[test] - preds.polyNew$fit)^2)
mse.polyNew
## [1] 1622
mse.fit.3 <- mean((Wage$wage[test] - preds.fit.3$fit)^2)
mse.fit.3
## [1] 1278.3
which suggests that the new model,
Model 3 : wage ∼ education + poly(age, 3) ,
is in fact better fit to predict the employee’s wage variable.
1.2 Piece-wise constant functions
Here we try to fit a piece-wise constant function to describe the employee’s wage in terms
of their age. To do so we use the cut() function as shown below
stepfunction.lm.fit.Wage <- lm(wage ~ cut(age, 4), data = Wage)
coef(summary(stepfunction.lm.fit.Wage))
## Estimate Std. Error t value
## (Intercept) 94.158 1.476 63.790
## cut(age, 4)(33.5,49] 24.053 1.829 13.148
## cut(age, 4)(49,64.5] 23.665 2.068 11.443
## cut(age, 4)(64.5,80.1] 7.641 4.987 1.532
## Pr(>|t|)
## (Intercept) 0.000e+00
## cut(age, 4)(33.5,49] 1.982e-38
## cut(age, 4)(49,64.5] 1.041e-29
## cut(age, 4)(64.5,80.1] 1.256e-01
The function cut() returns an ordered categorical variable; the lm() function then
creates a set of dummy variables for use in the regression. The age < 33.5 category is left
10
out, so the intercept coefficient of $ 94, 158 can be interpreted as the average salary for
those under 33.5 years of age, and the other coefficients can be interpreted as the average
additional salary for those in the other age groups. Of course, we can produce predictions
and plots just as we did in the case of the polynomial fit.
Finally, note that the cut() function automatically picked the cut-points of the age
variable. However, one can also impose the cut-points of her/his choice by using the
breaks option of the function.
2 Splines
2.1 Regression Splines
Regression splines fits can be produced in R, by loading the splines library. First we
construct an appropriate matrix of basis functions for a specified set of knots, by calling
the bs{splines}() function.
library(splines)
basis.fxknots <- bs(Wage$age, knots = c(25, 40, 60))
dim(basis.fxknots)
## [1] 3000 6
Alternatively, we can let the library determine the correct number of knots by deciding
only the required degrees of freedoms, df. For a degree-d step polynomial and a fitted
model with K notes, one needs (d + 1)(K + 1) − d K = d + K + 1 “dofs”, or d + K “dofs”
if there is no an intercept in the model. In particular, for a cubic spline basis without
an intercept (default) and with 6 “dofs”, the model is constrained to use only 3
knots which are distributed along uniform quantiles of the age variable.
basis.fxdf1 <- bs(Wage$age, df = 6, intercept = FALSE)
dim(basis.fxdf1)
## [1] 3000 6
attr(basis.fxdf1, "knots")
## 25% 50% 75%
## 33.75 42.00 51.00
11
Should we demand this model to also have an intercept:
basis.fxdf2 <- bs(Wage$age, df = 6, intercept = TRUE)
dim(basis.fxdf2)
## [1] 3000 6
attr(basis.fxdf2, "knots")
## 33.33% 66.67%
## 37 48
whereas for one polynomial degree higher, i.e. quartic spline:
basis.fxdf3 <- bs(Wage$age, df = 6, degree = 4, intercept = TRUE)
dim(basis.fxdf3)
## [1] 3000 6
attr(basis.fxdf3, "knots")
## 50%
## 42
The first case of the cubic splines referenced above seems more promising. To produce a
prediction fit
# Produce an age Grid
ageMinMax <- range(Wage$age)
age.grid <- seq(from = ageMinMax[1], to = ageMinMax[2])
# Produce a prediction fit
splines.bs1.fit <- lm(wage ~ bs(age, df = 6, intercept = FALSE),
data = Wage)
splines.bs1.pred <- predict(splines.bs1.fit, newdata = list(age =
age.grid),
se.fit = TRUE)
12
se.bands <- cbind(splines.bs1.pred$fit + 2 * splines.bs1.pred$se.
fit,
splines.bs1.pred$fit - 2 * splines.bs1.pred$se.fit)
and a corresponding plot of the Wage ∼ Age dependence
par(mfrow = c(1, 1))
par(mfrow = c(1, 1), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(Wage$age, Wage$wage, col = "grey", xlab = "Age", ylab = "
Wage",
cex = 0.5, pch = 8)
lines(age.grid, splines.bs1.pred$fit, col = "blue", lwd = 2)
matlines(age.grid, se.bands, lty = "dashed", col = "blue")
title(main = "Wage vs empoloyee Age nRegression and Natural
Splines Fit [Wage{ISLR}]",
outer = TRUE)
In order to fit a natural spline instead, that is a regression spline with linear boundary
conditions, we make use of the ns() function. All these results, Wage ∼ Age data points
as well the two prediction fits, that of the cubic spline (blue line) and the other of the
natural cubic spline (red line) are shown in Figure 2.
basis.ns <- ns(Wage$age, df = 6, intercept = TRUE)
splines.ns.fit <- lm(wage ~ ns(age, df = 6), data = Wage)
splines.ns.pred <- predict(splines.ns.fit, newdata = list(age =
age.grid),
se.fit = TRUE)
se.ns.bands <- cbind(splines.ns.pred$fit - 2 * splines.ns.pred$se
.fit,
splines.ns.pred$fit + 2 * splines.ns.pred$se.fit)
13
Figure 2: Cubic spline fit for Employees Wage vs their Age with 3 knots and 6 dofs (blue
lines). A natural spline fit with 6 dofs and an intercept is also depicted (red lines).
lines(age.grid, splines.ns.pred$fit, col = "red", lwd = 2)
matlines(age.grid, se.ns.bands, lty = "dashed", col = "red")
legend("topright", inset = 0.05, legend = c("Cubic Spline", "
Natural Cubic Spline"),
col = c("blue", "red"), lty = 1, lwd = c("2", "2"))
2.2 Smoothing Splines
Here, we make a smoothing spline fit for the wage ∼ age dependence of the employees’
Wage data set. To do so we utilize the smooth.spline{stats}() function as shown
below
# Smooth Spline with 16 effective dofs
sspline.fit <- smooth.spline(Wage$age, Wage$wage, df = 16)
14
# Cross Validated (LOOCV) Smooth Spline to let the software
# determine the optimal number of effective dofs
sspline.fit2 <- smooth.spline(Wage$age, Wage$wage, cv = FALSE,
df.offset = 1)
# effective dofs
sspline.fit2$df
## [1] 6.467555
which can be plotted by running the code below (Figure 3).
par(mfrow = c(1, 1))
par(mfrow = c(1, 1), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(age, wage, xlim = ageMinMax, cex = 0.5, col = "darkgrey",
xlab = "Age", ylab = "Wage", pch = 8)
lines(sspline.fit, col = "red", lwd = "2")
lines(sspline.fit2, col = "blue", lwd = "1")
title(main = "Wage vs employee Age n Smoothing Spline Fit [Wage{
ISLR}]",
outer = TRUE)
legend("topright", inset = 0.05, legend = c("16 dofs", "6.47 dofs
"),
col = c("red", "blue"), lty = 1, lwd = c("2", "1"), cex =
0.8)
15
Figure 3: Smoothing spline fits for Employees Wage vs their Age. One with pre-configured
16 effective dofs (red line) and the other one with 6.47 effective dofs as determine by
LOOCV method.
3 Local Regression
Here, as an alternative to produce a non-linear fit, we perform local regression by making
use of the loessstats() function.
# Local Regression with each neighborhood spanning 20% of the
# observations
loess.fit <- loess(wage ~ age, span = 0.2, data = Wage)
# Local Regression with each neighborhood spannings 50% of
# the observations
loess.fit2 <- loess(wage ~ age, span = 0.5, data = Wage)
16
and produce the corresponding plot by executing the code below (Figure 4).
par(mfrow = c(1, 1))
par(mfrow = c(1, 1), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(age, wage, xlim = ageMinMax, cex = 0.5, col = "darkgrey",
pch = 8, xlab = "Age", ylab = "Wage")
lines(age.grid, predict(loess.fit, data.frame(age = age.grid)),
col = "blue", lwd = "1")
lines(age.grid, predict(loess.fit2, data.frame(age = age.grid)),
col = "red", lwd = "1")
title(main = "Wage vs employee Age n Local Regression Fit [Wage{
ISLR}]",
outer = TRUE)
legend("topright", inset = 0.05, legend = c("Span 20%", "Span
50%"),
col = c("blue", "red"), lty = 1, lwd = c("1", "1"), cex =
0.8)
4 General Additive Models (GAMs)
As a generalization of the previously studied models, we now discuss additive linear models
but make use of a more flexible choice of the different fitting methods for the different
variables we are going to use as predictors. These class of models are the so-called
General Additive Models (GAMs) and have the general form
GAMs : yi = 0 +
p
j=1
fj(xij) + ϸi . (3)
As a first example we examine the fit
17
Figure 4: Local Regression fits for Employees Wage vs their Age. One with each neighbor-
hood pre-configured to span 20% (blue line) and the other one with 50% span percentage.
library(splines)
gam1 <- lm(wage ~ ns(year, 4) + ns(age, 5) + Wage$education,
data = Wage, subset = train)
However, in case we want to use the smooth splines or other components that cannot be
expressed in terms of the basis functions, we have to use more general sorts of GAMs to
make the fit, even if the model is additive. To do so we use the mgcv library, which was
introduced in [Wood, 2006] and it is provided here by the Oracle R distribution∗
.
The s() function, which is part of the mgcv library, is used to call smoothing spline fits.
∗
Alternatively, one can use the Trevor Hastie’s original library for that purpose, gam, [Hastie and Tib-
shirani, 1990]. However, we find mgcv much more complete to build GAMs models and we have chosen to
use this package to make our calculations.
18
To repeat the previous fit but with smoothing splines models we execute the following R
code.
library(mgcv)
gam.m3 <- gam(wage ~ s(year, k = 5) + s(age, k = 6) + education,
family = gaussian(), data = Wage, subset = train)
Here, we specify that the function of year should have 4 degrees of freedom, and that
the function of age will have 5 degrees of freedom. Since education is a categorical
variable, we leave it as is, and it is converted by the function into four dummy variables.
The produced fitted model can be produced as below.
par(mfrow = c(1, 1))
par(mfrow = c(1, 3), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(gam.m3, se = TRUE, col = "blue")
plot(education[train], gam.m3$y)
title("Smoothing Splines Fit [mgcv]nWage ~ s (year,k = 5) + s (
age,k = 6) + education",
outer = TRUE)
Note that in the first plot of Figure 5 the function year looks rather linear. We can
perform a series of ANOVA tests in order to determine which of these three models is best:
a GAM that excludes year (M1), a GAM that uses a linear function of year (M2), or a
GAM that uses a spline function of year (M3) as the one build above? Note, that in all
these models we do include the education variable which seems to be a good choice
according the short discussion in the end of section 1.1.
gam.m1 <- gam(wage ~ s(age, k = 6) + education, family = gaussian
(),
data = Wage, subset = train)
gam.m2 <- gam(wage ~ year + s(age, k = 6) + education, family =
gaussian(),
data = Wage, subset = train)
19
Figure 5: GAM fitted model using smoothing splines through mgcv library.
anova(gam.m1, gam.m2, gam.m3, test = "F")
## Analysis of Deviance Table
##
## Model 1: wage ~ s(age, k = 6) + education
## Model 2: wage ~ year + s(age, k = 6) + education
## Model 3: wage ~ s(year, k = 5) + s(age, k = 6) + education
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 1489.0 1812973
## 2 1488.0 1804686 1.00039 8287.4 6.8395 0.008999 **
## 3 1487.2 1801351 0.77868 3334.8 3.5358 0.069726 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
20
We find that there is compelling evidence that a GAM with a linear function of year is
better than a GAM that does not include year at all (F= 6.8395, p-value=0.00899).
However, there is no strong evidence that a non-linear function of year is actually re-
quired (F=3.5358, p-value=0.069726). So, based on the results of this ANOVA test,
the M2 model is preferred. Indeed, a closer look in the summary of the last fitted model
gam.m3
summary(gam.m3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## wage ~ s(year, k = 5) + s(age, k = 6) + education
##
## Parametric coefficients:
## Estimate Std. Error t value
## (Intercept) 86.395 2.845 30.373
## education2. HS Grad 9.289 3.266 2.844
## education3. Some College 23.054 3.450 6.682
## education4. College Grad 37.938 3.404 11.144
## education5. Advanced Degree 61.579 3.738 16.473
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## education2. HS Grad 0.00452 **
## education3. Some College 3.31e-11 ***
## education4. College Grad < 2e-16 ***
## education5. Advanced Degree < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(year) 1.760 2.171 3.908 0.0179 *
## s(age) 3.018 3.658 29.208 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.288 Deviance explained = 29.2%
## GCV score = 1219.2 Scale est. = 1211.2 n = 1497
21
reveals that a linear function in year is adequate for this term (F= 3.90, p-value =
0.0179), whereas for the age variable a non-linear function is required (F= 29.208,
p-value < 2e-16). Note, that the p-values for year and age correspond to a null
hypothesis of a linear relationship of the particular GAM term versus the alternative of a
non-linear relationship.
Of course, we can make predictions as before using a test data set of Wage and make a
more safe conclusion by comparing the Mean-Squared Errors of the two models gam.m2
and gam.m3.
gam.m2.pred <- predict(gam.m2, newdata = Wage[test, ])
gam.m3.pred <- predict(gam.m3, newdata = Wage[test, ])
mean((Wage[test, ]$wage - gam.m2.pred)^2)
## [1] 1275.063
mean((Wage[test, ]$wage - gam.m3.pred)^2)
## [1] 1276.13
Again, the gam.m2 model is found to be a better fit.
References
[Hastie and Tibshirani, 1990] Hastie, T. and Tibshirani, R. (1990). Generalized Additive
Models.
[James et al., 2013] James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An
Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics).
Springer, 1st ed. 2013. corr. 4th printing 2014 edition.
[Wood, 2006] Wood, S. (2006). Generalized Additive Models: An Introduction with R. Chap-
man and Hall/CRC.
22

Moving Beyond Linearity (Article 7 - Practical Exercises)

  • 1.
    Moving Beyond Linearity[ISLR.2013.Ch7-7] Theodore Grammatikopoulos∗ Tue 6th Jan, 2015 Abstract Linear models are relatively simple to describe and implement, and have advantages over other approaches in terms of interpretation and inference. However, standard linear regression can have significant limitations in terms of predictive power. This is because the linearity assumption is almost always an approximation, and sometimes a poor one. In article 6 we see that we can improve Least Squares Estimation methods by using Ridge Regression, Lasso, Principal Components Regression (PCR), Principal Least Squares (PLS), and other techniques. In that setting, the improvement is obtained by reducing the complexity of the linear model, and hence the variance of the estimates. But we are still using a linear model, which can only be improved so far. Here, we also relax the linearity assumption but we still attempting to maintain as much interpret-ability as possible. We do this by examining very simple extensions of linear models like Polynomial Regression and Piece-wise Step Functions, as well as more sophisticated approaches such as Splines, Local Regression, and Generalized Additive Models (GAMs). ## OTN License Agreement: Oracle Technology Network - Developer ## Oracle Distribution of R version 3.0.1 (--) Good Sport ## Copyright (C) The R Foundation for Statistical Computing ## Platform: x86_64-unknown-linux-gnu (64-bit) D:20150106213538+02’00’ ∗ e-mail:tgrammat@gmail.com 1
  • 2.
    1 Non-Linear Modeling Inthis lab we re-analyze the Wage data set considered in the examples throughout Chapter 7 of “ISLR.2013” [James et al., 2013]. We will see that many of the complex non-linear fitting procedures discussed in this Chapter can be easily implemented in R. library(ISLR) attach(Wage) 1.1 Polynomial Regression As a first attempt to describe employee’s wage in terms of their age, we will try a poly- nomial regression fit. We will determine the best polynomial degree to do so and we will afterwards examine to what extent this could be successful. The reason to search for a non-linear fit to describe the wage ∼ age dependence, it is almost apparent by the corresponding plot of the two variables, shown in Figure 1 below. First, the employee data set can be easily distinguished in two groups, a “High Earners Group” and a “Low Earners” one. Secondly, the wage ∼ age dependence of this “Low Earners” employees group is certainly non-linear and most probably of a higher than the second polynomial degree. Next, we should decide on the exact degree of the polynomial to use. In the article “Linear Model Selection and Regularization (ISLR.2013.Ch6-6)”, we studied two different ways to do so. Either by applying some subset variable selection method or by using cross-validation. Here, we will discuss an alternative approach, the so-called Hypothesis Testing. More specifically, we can fit models ranging from linear to degree-5 polynomial and seek to determine the simplest model which is sufficient to explain the wage ∼ age relationship. To do so we perform analysis of variance (ANOVA, F-test), by using the anova() function, in order to test the null hypothesis that a model M1 is sufficient to explain the data against the alternative hypothesis that a more complex model M2 is required. In order to use the anova() function, M1 and M2 must be nested models: the predictors in M1 must be a subset of the predictors in M2. In this case, we fit five different models and sequentially compare the simpler model to the more complex model. lm.Poly1.fit <- lm(wage ~ age, data = Wage) lm.Poly2.fit <- lm(wage ~ poly(age, 2), data = Wage) lm.Poly3.fit <- lm(wage ~ poly(age, 3), data = Wage) 2
  • 3.
    lm.Poly4.fit <- lm(wage~ poly(age, 4), data = Wage) lm.Poly5.fit <- lm(wage ~ poly(age, 5), data = Wage) Figure 1: 4-degree Polynomial fit for Employees’ Wages vs their Age. The probability of an employee to be a high earner person as a function of her age is also depicted. anova(lm.Poly1.fit, lm.Poly2.fit, lm.Poly3.fit, lm.Poly4.fit, lm.Poly5.fit) ## Analysis of Variance Table ## ## Model 1: wage ~ age ## Model 2: wage ~ poly(age, 2) ## Model 3: wage ~ poly(age, 3) ## Model 4: wage ~ poly(age, 4) ## Model 5: wage ~ poly(age, 5) 3
  • 4.
    ## Res.Df RSSDf Sum of Sq F Pr(>F) ## 1 2998 5022216 ## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 *** ## 3 2996 4777674 1 15756 9.8888 0.001679 ** ## 4 2995 4771604 1 6070 3.8098 0.051046 . ## 5 2994 4770322 1 1283 0.8050 0.369682 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The p-value comparing the linear Model 1 to the quadratic Model 2 is essentially zero (< 1015 ), indicating that a linear fit is not sufficient. Similarly the p-value comparing the quadratic Model 2 to the cubic Model 3 is very low (∼ 0.0017), so the quadratic fit is also insufficient. The p-value comparing the cubic to the degree-4 polynomials, Model 3 and Model 4, is approximately 5% while the degree-5 polynomial Model 5 seems unnecessary because its p-value is ∼ 0.37 with a not large F-statistic. Hence, either a cubic or a quartic polynomial appear to provide a reasonable fit to the data, but lower or higher-order models are not justified. Here, we choose to describe the wage ∼ age dependence by a quartic polynomial, i.e.: wage ∼ Intercept + 0 ∗ age + 1 ∗ age2 + 2 ∗ age3 + 3 ∗ age4 . The estimated coefficient calculated by the method can be retrieved by the following call lm.Poly4.fit.Wage <- lm.Poly4.fit coef(summary(lm.Poly4.fit.Wage)) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00 ## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28 ## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32 ## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03 ## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02 Next, we create a grid of values for age at which we want predictions, call the generic predict() functions and calculate the standard errors ageMinMax <- range(Wage$age) age.grid <- seq(from = ageMinMax[1], to = ageMinMax[2]) 4
  • 5.
    preds.poly <- predict(lm.Poly4.fit.Wage,newdata = list(age = age .grid), se.fit = TRUE) se.bands <- cbind(preds.poly$fit + 2 * preds.poly$se.fit, preds. poly$fit - 2 * preds.poly$se.fit) Finally, we plot the data and add the degree-4 Polynomial fit par(mfrow = c(1, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0)) plot(Wage$age, Wage$wage, xlim = ageMinMax, cex = 0.5, col = " darkgrey", xlab = "Age", ylab = "Wage") title("Wage vs Age Fit n Degree-4 Polynomial [Wage]", outer = TRUE) lines(age.grid, preds.poly$fit, lwd = 2, col = "blue") matlines(age.grid, se.bands, lwd = 2, col = "blue", lty = 3) As shown in Figure 1, the employee data set can be easily distinguished in two groups, a “High Earners Group” and a “Low Earners” one. To calculate the probability an employee to annually earn more than 250k USD, we create the appropriate response vector for the categorical variable of 1l (wage > 250) glm.Poly4.binomial.fit.Wage <- glm(I(wage > 250) ~ poly(age, 4), data = Wage, family = binomial) and make the predictions as before. glm.Poly4.binomial.preds.Wage <- predict(glm.Poly4.binomial.fit. Wage, newdata = list(age = age.grid), se.fit = TRUE) However, calculating the probability P (Wage > 250 | Age) and its corresponding confi- dence intervals is slightly more involved than in the linear regression case. The default 5
  • 6.
    prediction type fora glm() model is type="link", which what we use here. This means we get predictions for the logit, i.e we have fit a model of the form log P(Y = 1 | X) 1 − P(Y = 1 | X) = X , (1) which means that the predictions, as well as its standard errors are of X form. Therefore, if we have to plot the P (Wage > 250 | Age) as a function of employee’s age, we have to transform the resulting fit accordingly, that is P(Y = 1|X) = exp( X) 1 + exp( X) , (2) or in R code preds <- glm.Poly4.binomial.preds.Wage pfit <- exp(preds$fit)/(1 + exp(preds$fit)) se.bands.logit <- cbind(preds$fit + 2 * preds$se.fit, preds$fit - 2 * preds$se.fit) se.bands <- exp(se.bands.logit)/(1 + exp(se.bands.logit)) and plot the result which is shown in the left panel of Figure 1. plot(age, I(wage > 250), xlim = ageMinMax, type = "n", ylim = c (0, 0.2), xlab = "Age", ylab = "P(Wage>250|Age)") points(jitter(age), (I(wage > 250)/5), cex = 0.5, pch = "|", col = "darkgrey") lines(age.grid, pfit, lwd = 2, col = "blue") matlines(age.grid, se.bands, lwd = 2, lty = 3, col = "blue") It is interesting to note here that the function 6
  • 7.
    coef(summary(lm.Poly4.fit.Wage)) ## Estimate Std.Error t value Pr(>|t|) ## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00 ## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28 ## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32 ## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03 ## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02 returns a matrix whose columns are a basis of orthogonal polynomials, which essentially means that each column is a linear combination of the variables age, age^2, age^3 and age^4. However we can also obtain a direct fit to the {age,age^2,age^3,age^4} variable basis by demanding raw=TRUE in the the previous code as shown below. This does not affect the model in a meaningful way. The choice of basis clearly affects the coefficient estimates, but it does not affect the fitted values obtained. # Direct Fit in {age,age^2,age^3,age^4} Basis lm.Poly4.fit.Wage2 <- lm(wage ~ poly(age, 4, raw = TRUE), data = Wage) coef(summary(lm.Poly4.fit.Wage2)) ## Estimate Std. Error ## (Intercept) -1.841542e+02 6.004038e+01 ## poly(age, 4, raw = TRUE)1 2.124552e+01 5.886748e+00 ## poly(age, 4, raw = TRUE)2 -5.638593e-01 2.061083e-01 ## poly(age, 4, raw = TRUE)3 6.810688e-03 3.065931e-03 ## poly(age, 4, raw = TRUE)4 -3.203830e-05 1.641359e-05 ## t value Pr(>|t|) ## (Intercept) -3.067172 0.0021802539 ## poly(age, 4, raw = TRUE)1 3.609042 0.0003123618 ## poly(age, 4, raw = TRUE)2 -2.735743 0.0062606446 ## poly(age, 4, raw = TRUE)3 2.221409 0.0263977518 ## poly(age, 4, raw = TRUE)4 -1.951938 0.0510386498 Two other equivalents ways of calculating the same fit whereas protecting power terms of age are the following: # Direct Fit in {age,age^2,age^3,age^4} Basis 7
  • 8.
    lm.Poly4.fit.Wage3 <- lm(wage~ age + I(age^2) + I(age^3) + I(age ^4), data = Wage) coef(summary(lm.Poly4.fit.Wage3)) # Direct Fit in {age,age^2,age^3,age^4} Basis lm.Poly4.fit.Wage4 <- lm(wage ~ cbind(age, age^2, age^3, age^4), data = Wage) coef(summary(lm.Poly4.fit.Wage4)) Comparing now the fitted values obtained in either case we found them identical, as expected. preds.raw <- predict(lm.Poly4.fit.Wage2, newdata = list(age = age .grid), se.fit = TRUE) max(abs(preds.poly$fit - preds.raw$fit)) ## [1] 8.739676e-12 Note: The ANOVA method also works in more general cases, that is when terms other than the orthogonal polynomials are also included. For example, we can use anova() to also compare these three models fit.1.Wage <- lm(wage ~ education + age, data = Wage) fit.2.Wage <- lm(wage ~ education + poly(age, 2), data = Wage) fit.3.Wage <- lm(wage ~ education + poly(age, 3), data = Wage) fit.4.Wage <- lm(wage ~ education + poly(age, 4), data = Wage) anova(fit.1.Wage, fit.2.Wage, fit.3.Wage, fit.4.Wage) 8
  • 9.
    ## Analysis ofVariance Table ## ## Model 1: wage ~ education + age ## Model 2: wage ~ education + poly(age, 2) ## Model 3: wage ~ education + poly(age, 3) ## Model 4: wage ~ education + poly(age, 4) ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 2994 3867992 ## 2 2993 3725395 1 142597 114.6595 < 2e-16 *** ## 3 2992 3719809 1 5587 4.4921 0.03413 * ## 4 2991 3719777 1 32 0.0255 0.87308 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 giving an outcome which actually support the third case of model, i.e.: Model 3 : wage ∼ education + poly(age, 3) . Now comparing this new model with the one we have examined before, i.e. wage ∼ Intercept + 0 ∗ age + 1 ∗ age2 + 2 ∗ age3 + 3 ∗ age4 , we obtain the following results # Split the data set in a Train and a Test Data Part set.seed(356) train <- sample(c(TRUE, FALSE), nrow(Wage), rep = TRUE) test <- (!train) # wage ~ poly(age,4) Model lm.Poly4.fit <- lm(wage ~ poly(age, 4), data = Wage[train, ]) preds.polyNew <- predict(lm.Poly4.fit, newdata = Wage[test, ], se.fit = TRUE) # wage ~ education + poly(age,3) fit.3.Wage <- lm(wage ~ education + poly(age, 3), data = Wage[ train, ]) preds.fit.3 <- predict(fit.3.Wage, newdata = Wage[test, ], se.fit = TRUE) 9
  • 10.
    # MSEs calculation mse.polyNew<- mean((Wage$wage[test] - preds.polyNew$fit)^2) mse.polyNew ## [1] 1622 mse.fit.3 <- mean((Wage$wage[test] - preds.fit.3$fit)^2) mse.fit.3 ## [1] 1278.3 which suggests that the new model, Model 3 : wage ∼ education + poly(age, 3) , is in fact better fit to predict the employee’s wage variable. 1.2 Piece-wise constant functions Here we try to fit a piece-wise constant function to describe the employee’s wage in terms of their age. To do so we use the cut() function as shown below stepfunction.lm.fit.Wage <- lm(wage ~ cut(age, 4), data = Wage) coef(summary(stepfunction.lm.fit.Wage)) ## Estimate Std. Error t value ## (Intercept) 94.158 1.476 63.790 ## cut(age, 4)(33.5,49] 24.053 1.829 13.148 ## cut(age, 4)(49,64.5] 23.665 2.068 11.443 ## cut(age, 4)(64.5,80.1] 7.641 4.987 1.532 ## Pr(>|t|) ## (Intercept) 0.000e+00 ## cut(age, 4)(33.5,49] 1.982e-38 ## cut(age, 4)(49,64.5] 1.041e-29 ## cut(age, 4)(64.5,80.1] 1.256e-01 The function cut() returns an ordered categorical variable; the lm() function then creates a set of dummy variables for use in the regression. The age < 33.5 category is left 10
  • 11.
    out, so theintercept coefficient of $ 94, 158 can be interpreted as the average salary for those under 33.5 years of age, and the other coefficients can be interpreted as the average additional salary for those in the other age groups. Of course, we can produce predictions and plots just as we did in the case of the polynomial fit. Finally, note that the cut() function automatically picked the cut-points of the age variable. However, one can also impose the cut-points of her/his choice by using the breaks option of the function. 2 Splines 2.1 Regression Splines Regression splines fits can be produced in R, by loading the splines library. First we construct an appropriate matrix of basis functions for a specified set of knots, by calling the bs{splines}() function. library(splines) basis.fxknots <- bs(Wage$age, knots = c(25, 40, 60)) dim(basis.fxknots) ## [1] 3000 6 Alternatively, we can let the library determine the correct number of knots by deciding only the required degrees of freedoms, df. For a degree-d step polynomial and a fitted model with K notes, one needs (d + 1)(K + 1) − d K = d + K + 1 “dofs”, or d + K “dofs” if there is no an intercept in the model. In particular, for a cubic spline basis without an intercept (default) and with 6 “dofs”, the model is constrained to use only 3 knots which are distributed along uniform quantiles of the age variable. basis.fxdf1 <- bs(Wage$age, df = 6, intercept = FALSE) dim(basis.fxdf1) ## [1] 3000 6 attr(basis.fxdf1, "knots") ## 25% 50% 75% ## 33.75 42.00 51.00 11
  • 12.
    Should we demandthis model to also have an intercept: basis.fxdf2 <- bs(Wage$age, df = 6, intercept = TRUE) dim(basis.fxdf2) ## [1] 3000 6 attr(basis.fxdf2, "knots") ## 33.33% 66.67% ## 37 48 whereas for one polynomial degree higher, i.e. quartic spline: basis.fxdf3 <- bs(Wage$age, df = 6, degree = 4, intercept = TRUE) dim(basis.fxdf3) ## [1] 3000 6 attr(basis.fxdf3, "knots") ## 50% ## 42 The first case of the cubic splines referenced above seems more promising. To produce a prediction fit # Produce an age Grid ageMinMax <- range(Wage$age) age.grid <- seq(from = ageMinMax[1], to = ageMinMax[2]) # Produce a prediction fit splines.bs1.fit <- lm(wage ~ bs(age, df = 6, intercept = FALSE), data = Wage) splines.bs1.pred <- predict(splines.bs1.fit, newdata = list(age = age.grid), se.fit = TRUE) 12
  • 13.
    se.bands <- cbind(splines.bs1.pred$fit+ 2 * splines.bs1.pred$se. fit, splines.bs1.pred$fit - 2 * splines.bs1.pred$se.fit) and a corresponding plot of the Wage ∼ Age dependence par(mfrow = c(1, 1)) par(mfrow = c(1, 1), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0)) plot(Wage$age, Wage$wage, col = "grey", xlab = "Age", ylab = " Wage", cex = 0.5, pch = 8) lines(age.grid, splines.bs1.pred$fit, col = "blue", lwd = 2) matlines(age.grid, se.bands, lty = "dashed", col = "blue") title(main = "Wage vs empoloyee Age nRegression and Natural Splines Fit [Wage{ISLR}]", outer = TRUE) In order to fit a natural spline instead, that is a regression spline with linear boundary conditions, we make use of the ns() function. All these results, Wage ∼ Age data points as well the two prediction fits, that of the cubic spline (blue line) and the other of the natural cubic spline (red line) are shown in Figure 2. basis.ns <- ns(Wage$age, df = 6, intercept = TRUE) splines.ns.fit <- lm(wage ~ ns(age, df = 6), data = Wage) splines.ns.pred <- predict(splines.ns.fit, newdata = list(age = age.grid), se.fit = TRUE) se.ns.bands <- cbind(splines.ns.pred$fit - 2 * splines.ns.pred$se .fit, splines.ns.pred$fit + 2 * splines.ns.pred$se.fit) 13
  • 14.
    Figure 2: Cubicspline fit for Employees Wage vs their Age with 3 knots and 6 dofs (blue lines). A natural spline fit with 6 dofs and an intercept is also depicted (red lines). lines(age.grid, splines.ns.pred$fit, col = "red", lwd = 2) matlines(age.grid, se.ns.bands, lty = "dashed", col = "red") legend("topright", inset = 0.05, legend = c("Cubic Spline", " Natural Cubic Spline"), col = c("blue", "red"), lty = 1, lwd = c("2", "2")) 2.2 Smoothing Splines Here, we make a smoothing spline fit for the wage ∼ age dependence of the employees’ Wage data set. To do so we utilize the smooth.spline{stats}() function as shown below # Smooth Spline with 16 effective dofs sspline.fit <- smooth.spline(Wage$age, Wage$wage, df = 16) 14
  • 15.
    # Cross Validated(LOOCV) Smooth Spline to let the software # determine the optimal number of effective dofs sspline.fit2 <- smooth.spline(Wage$age, Wage$wage, cv = FALSE, df.offset = 1) # effective dofs sspline.fit2$df ## [1] 6.467555 which can be plotted by running the code below (Figure 3). par(mfrow = c(1, 1)) par(mfrow = c(1, 1), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0)) plot(age, wage, xlim = ageMinMax, cex = 0.5, col = "darkgrey", xlab = "Age", ylab = "Wage", pch = 8) lines(sspline.fit, col = "red", lwd = "2") lines(sspline.fit2, col = "blue", lwd = "1") title(main = "Wage vs employee Age n Smoothing Spline Fit [Wage{ ISLR}]", outer = TRUE) legend("topright", inset = 0.05, legend = c("16 dofs", "6.47 dofs "), col = c("red", "blue"), lty = 1, lwd = c("2", "1"), cex = 0.8) 15
  • 16.
    Figure 3: Smoothingspline fits for Employees Wage vs their Age. One with pre-configured 16 effective dofs (red line) and the other one with 6.47 effective dofs as determine by LOOCV method. 3 Local Regression Here, as an alternative to produce a non-linear fit, we perform local regression by making use of the loessstats() function. # Local Regression with each neighborhood spanning 20% of the # observations loess.fit <- loess(wage ~ age, span = 0.2, data = Wage) # Local Regression with each neighborhood spannings 50% of # the observations loess.fit2 <- loess(wage ~ age, span = 0.5, data = Wage) 16
  • 17.
    and produce thecorresponding plot by executing the code below (Figure 4). par(mfrow = c(1, 1)) par(mfrow = c(1, 1), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0)) plot(age, wage, xlim = ageMinMax, cex = 0.5, col = "darkgrey", pch = 8, xlab = "Age", ylab = "Wage") lines(age.grid, predict(loess.fit, data.frame(age = age.grid)), col = "blue", lwd = "1") lines(age.grid, predict(loess.fit2, data.frame(age = age.grid)), col = "red", lwd = "1") title(main = "Wage vs employee Age n Local Regression Fit [Wage{ ISLR}]", outer = TRUE) legend("topright", inset = 0.05, legend = c("Span 20%", "Span 50%"), col = c("blue", "red"), lty = 1, lwd = c("1", "1"), cex = 0.8) 4 General Additive Models (GAMs) As a generalization of the previously studied models, we now discuss additive linear models but make use of a more flexible choice of the different fitting methods for the different variables we are going to use as predictors. These class of models are the so-called General Additive Models (GAMs) and have the general form GAMs : yi = 0 + p j=1 fj(xij) + ϸi . (3) As a first example we examine the fit 17
  • 18.
    Figure 4: LocalRegression fits for Employees Wage vs their Age. One with each neighbor- hood pre-configured to span 20% (blue line) and the other one with 50% span percentage. library(splines) gam1 <- lm(wage ~ ns(year, 4) + ns(age, 5) + Wage$education, data = Wage, subset = train) However, in case we want to use the smooth splines or other components that cannot be expressed in terms of the basis functions, we have to use more general sorts of GAMs to make the fit, even if the model is additive. To do so we use the mgcv library, which was introduced in [Wood, 2006] and it is provided here by the Oracle R distribution∗ . The s() function, which is part of the mgcv library, is used to call smoothing spline fits. ∗ Alternatively, one can use the Trevor Hastie’s original library for that purpose, gam, [Hastie and Tib- shirani, 1990]. However, we find mgcv much more complete to build GAMs models and we have chosen to use this package to make our calculations. 18
  • 19.
    To repeat theprevious fit but with smoothing splines models we execute the following R code. library(mgcv) gam.m3 <- gam(wage ~ s(year, k = 5) + s(age, k = 6) + education, family = gaussian(), data = Wage, subset = train) Here, we specify that the function of year should have 4 degrees of freedom, and that the function of age will have 5 degrees of freedom. Since education is a categorical variable, we leave it as is, and it is converted by the function into four dummy variables. The produced fitted model can be produced as below. par(mfrow = c(1, 1)) par(mfrow = c(1, 3), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0)) plot(gam.m3, se = TRUE, col = "blue") plot(education[train], gam.m3$y) title("Smoothing Splines Fit [mgcv]nWage ~ s (year,k = 5) + s ( age,k = 6) + education", outer = TRUE) Note that in the first plot of Figure 5 the function year looks rather linear. We can perform a series of ANOVA tests in order to determine which of these three models is best: a GAM that excludes year (M1), a GAM that uses a linear function of year (M2), or a GAM that uses a spline function of year (M3) as the one build above? Note, that in all these models we do include the education variable which seems to be a good choice according the short discussion in the end of section 1.1. gam.m1 <- gam(wage ~ s(age, k = 6) + education, family = gaussian (), data = Wage, subset = train) gam.m2 <- gam(wage ~ year + s(age, k = 6) + education, family = gaussian(), data = Wage, subset = train) 19
  • 20.
    Figure 5: GAMfitted model using smoothing splines through mgcv library. anova(gam.m1, gam.m2, gam.m3, test = "F") ## Analysis of Deviance Table ## ## Model 1: wage ~ s(age, k = 6) + education ## Model 2: wage ~ year + s(age, k = 6) + education ## Model 3: wage ~ s(year, k = 5) + s(age, k = 6) + education ## Resid. Df Resid. Dev Df Deviance F Pr(>F) ## 1 1489.0 1812973 ## 2 1488.0 1804686 1.00039 8287.4 6.8395 0.008999 ** ## 3 1487.2 1801351 0.77868 3334.8 3.5358 0.069726 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 20
  • 21.
    We find thatthere is compelling evidence that a GAM with a linear function of year is better than a GAM that does not include year at all (F= 6.8395, p-value=0.00899). However, there is no strong evidence that a non-linear function of year is actually re- quired (F=3.5358, p-value=0.069726). So, based on the results of this ANOVA test, the M2 model is preferred. Indeed, a closer look in the summary of the last fitted model gam.m3 summary(gam.m3) ## ## Family: gaussian ## Link function: identity ## ## Formula: ## wage ~ s(year, k = 5) + s(age, k = 6) + education ## ## Parametric coefficients: ## Estimate Std. Error t value ## (Intercept) 86.395 2.845 30.373 ## education2. HS Grad 9.289 3.266 2.844 ## education3. Some College 23.054 3.450 6.682 ## education4. College Grad 37.938 3.404 11.144 ## education5. Advanced Degree 61.579 3.738 16.473 ## Pr(>|t|) ## (Intercept) < 2e-16 *** ## education2. HS Grad 0.00452 ** ## education3. Some College 3.31e-11 *** ## education4. College Grad < 2e-16 *** ## education5. Advanced Degree < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(year) 1.760 2.171 3.908 0.0179 * ## s(age) 3.018 3.658 29.208 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.288 Deviance explained = 29.2% ## GCV score = 1219.2 Scale est. = 1211.2 n = 1497 21
  • 22.
    reveals that alinear function in year is adequate for this term (F= 3.90, p-value = 0.0179), whereas for the age variable a non-linear function is required (F= 29.208, p-value < 2e-16). Note, that the p-values for year and age correspond to a null hypothesis of a linear relationship of the particular GAM term versus the alternative of a non-linear relationship. Of course, we can make predictions as before using a test data set of Wage and make a more safe conclusion by comparing the Mean-Squared Errors of the two models gam.m2 and gam.m3. gam.m2.pred <- predict(gam.m2, newdata = Wage[test, ]) gam.m3.pred <- predict(gam.m3, newdata = Wage[test, ]) mean((Wage[test, ]$wage - gam.m2.pred)^2) ## [1] 1275.063 mean((Wage[test, ]$wage - gam.m3.pred)^2) ## [1] 1276.13 Again, the gam.m2 model is found to be a better fit. References [Hastie and Tibshirani, 1990] Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. [James et al., 2013] James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics). Springer, 1st ed. 2013. corr. 4th printing 2014 edition. [Wood, 2006] Wood, S. (2006). Generalized Additive Models: An Introduction with R. Chap- man and Hall/CRC. 22