1. Linear Regression : comparision of Gradient Descent and Normal Equations
Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The following study will explore linear regression over
two datasources by comparing Gradient Descent (GD) and
Normal Equations techniques adjusting the parameters to
avoiding overfitting. The first dataset is called FRIED and
was proposed in Friedman [?] and Breiman [?], it com-
prises 40,768 cases, 10 attributes (0 nominal, 10 continu-
ous). The second dataset is called ABALONE comprising
4,177 cases, 8 attributes (1 categorical, 7 continuous).
2. Activities
There is a large discussion about the difference in perfor-
mance from Batch Gradient Descent versus On-line GD in
Wilsona and Martinez [?]. To overcome some of these dis-
advantages such as slowness to be close of the minimum,
one of the state-of-the-art approaches found in literature for
solving linear regression problems is mini batch Gradient
Descent. Another advanced algorithm is stochastic GD, one
discussion is available at Gardner [?]. Least but not last,
Least mean squares is a class of adaptive filter described in
the book by Widrow and Stearns [?].
3. Proposed Solutions
It was implemented two algorithms to deal with the prob-
lem. Both algorithms were developed in Octave language,
trying to apply vectorized implementation whenever is pos-
sible to try to achieve the best performance. As the first
approach, Gradient Descent using regularization (to avoid
overfitting) was created and used the following formula to
measure the cost function:
J(θ) = (1/2m)
m
i=1
((X[i]∗θ)−y[i]))2
+λ
n
j=1
θ[j]2
) (1)
The gradient was updated for theta at index 0 without us-
ing regularization parameter, while the other theta indexes
∗Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
were updated according to the formula below:
θ[j] = θ[j]−α
m
i=1
((X[i]∗θ)−y[i])∗X[i])+
λ
n
∗θ[j]) (2)
As a second approach, the Normal Equations algorithm
was implemented to compute the closed-form solution for
linear regression with the following code:
function [ t h e t a ] = normalEqn (X, y )
t h e t a = zeros ( s i z e (X, 2) , 1 ) ;
t h e t a = pinv (X’∗X)∗X’∗ y ;
end
In order to get the best model, it was developed some
code to plot learning curve (error x number of instances
the algorithm used) and validation curve (error over the λ).
With these graphs, we can selecte the best GD parameters
such as learning rate (α) or λ (regularization parameter).
The lambda values tested are: 0, 0.001, 0.003, 0.01, 0.03,
0.1, 0.3, 1 3, 10.
If the algorithm presents a biased behaviour, it was de-
veloped a simple code in R to test several polynomials in
the input data (x2, x3, and so on) and plot a graph to de-
tect the residuals and the multiplication of the two different
variables (nchoosek gives the binomial coefficient of n and
k) as the snippet below:
for z=1 : m
for i =1 : k
for j =1 : p %degree
column = ( ( i −1)∗k )+ j ;
X poly ( z , column ) = X( z , i ) ˆ j ;
end
end
end
c = nchoosek ( 1 : k , 2 ) ; %m u l t i p l i c a t i o n
for z=1 : m
for j =1: s i z e ( c , 1 )
column = ( k∗p ) + j ;
X poly ( z , column ) = X( z , c ( j , 1 ) ) ∗X( z , c ( j , 2 ) ) ;
end
end
1
2. 4. Experiments and Discussion
We will describe each dataset experiment in a diferent
sub-section.
4.1. FRIED dataset
Firstly, the FRIED dataset provided has 2 pieces: train
(with 87.5 percent or 35000 cases), and test set (with 12.5
percent or 5000 instances). By running the GD using a sim-
ple polynomial (degree =1 for all input variables). The theta
found was the following:
intercept x1 x2 x3 x4
Theta 12964.21 1412.12 1472.49 -19.51 2193.17
x5 x6 x7 x8 x9 x10
1081.09 -9.42 36.58 7.3 12.13 -43.92
Table 1. Theta found for Gradient Descent in FRIED
The cost was monitored for two scenarios below:
Cost Parameters J
Without Regularization
Alpha=0.03
nbr. iterations=400
3.45
With Regularization
Alpha=0.03
nbr. iterations=400
lambda=0.03
115.94
Table 2. Cost with and without regularization for FRIED
As it can be seen by the table of costs, the regulariza-
tion did not improve the result of the Gradient Descent for
FRIED, particularly, it was considerably worse, the follow-
ing values were tried 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3,
10.
The following cost function vs. number of iterations was
displayed for the GD without regularization:
Figure 1. GD function cost without regularization for FRIED.
By looking at the error, it can be noticed that it was very
large even for a resonable number of iterations. So, we used
R lm function to look for the residuals and the R-squared
coefficient to find linear correlations among the variables
and the output. The following table summarizes the finding:
lm formula Resid. std error R-squared
y ∼ x1+x2+x3+x4+x5
+x6+x7+x8+x9+x10
5533 0.2507
Table 3. R lm result for FRIED
4.1.1 Adding multiplication and higher degree polyno-
mials
As the error was still high, we tried to increase model com-
plexity, in the case, adding new polymonial degrees and
multiplication of variables two-by-two. The training error
was plotted in the following graph:
Figure 2. Increasing model complexity for FRIED.
The ”best” model found was with a 5th degree poly-
nomial. The error found in training set was J=1.39 with
lambda=0, cross validation error=18.49. But this lead to a
biased model, as the cross validation increases, so we can
stay with model at degree 1 or 3. After that, it was plot-
ted for degree=5 some lambda values. The best value found
was lambda=0 (despite all values are very similar for train-
ing and cross validation error, lambda it does not seem to
influence).
4.2. ABALONE dataset
ABALONE dataset provided has 2 pieces: train (with 84
percent or 3500 cases), and test set (with 16 percent or 677
instances). As linear regression input needs to be contin-
uous variables, the attribute sex, which is categorical was
converted into 3 different columns (one for each possible
value: ”M”,”F”, and ”I”) which was valued to 1.0 if the cat-
egory is matching or 0.0 if not. By applying this procedure,
s variable was splitted into 3 numerical features called s M,
s F, and s I due to the 3 possible values for the variable s.
intercept s M s F s I l
Theta 10.02 0.28 0.46 -0.69 0.07
d h ww shw vw sw
0.28 0.35 0.38 0.09 0.28 0.44
Table 4. Theta found for Gradient Descent in ABALONE
Again the cost associated was calculated as shown be-
low:
3. Cost Parameters J
Without Regularization
Alpha=0.03
nbr. iterations=400
4.1
With Regularization
Alpha=0.03
nbr. iterations=400
lambda=0.03
55.75
Table 5. Cost with and without regularization for ABALONE
Running GD gaves the following cost x number of itera-
tions graph without using regularization:
Figure 3. GD function cost without regularization for ABALONE.
Using R lm function to look for the residuals and the
R-squared coefficient to find linear correlations among the
variables and the output gave the following result:
lm formula Resid. std error R-squared
ry ∼ s M+s F+s I+l
+d+h+ww+shw+vw+sw
2865 0.2641
Table 6. R lm result for ABALONE
The last graph suggests that the best lambda value found
for this regression was 0.03.
The value found by looking at the learning curve (error
versus number of training examples) is given below:
Figure 4. Learning curve (error versus nbr. training examples)for
ABALONE .
The curve indicates that with 100 examples the error was
already with a variance that is acceptable (because J(train)
and J(test) are very close at other.
4.2.1 Adding multiplication and higher degree polyno-
mials
Trying to increase model complexity, it was used the code
to generate new variables, that is, new polymonial degrees
and multiplication of variables two-by-two. The training
error was plotted in the following graph:
Figure 5. Increasing model complexity for ABALONE.
The ”best” model found was with a 4th degree polyno-
mial. The error found was J=3.64 with lambda=0. After
that, it was plotted for degree=4 some lambda values. The
best value found was lambda=0.001 (despite all values are
very similar).
Figure 6. Lambda values for polynomial degree=4 for ABALONE.
4.3. Gradient Descent (GD) x Normal Equations
(NE)
The machine used to run has an Intel 2-core processor
with 2,26Ghz and 4Gb RAM. Next table summarizes per-
formance for ABALONE and FRIED dataset for both Gra-
dient Descent and Normal Equation algorithms:
The theta found for FRIED and ABALONE dataset are
shown in the next page:
4. Algorithm Parameters nbr training instances numerical attributes Datasource System CPU time (seconds)
Gradient Descent
Alpha=0.03
nbr. iterations=400
lambda=0.03
35000 10 FRIED 20.95
Normal Equations 35000 10 FRIED 100.34
Gradient Descent
Alpha=0.03
nbr. iterations=400
lambda=0.03
3500 10 ABALONE 1.23
Normal Equations 3500 10 ABALONE 1.12
intercept x1 x2 x3 x4
Theta 4101.59 4.54 4.72 -0.06 7.03
x5 x6 x7 x8 x9 x10
3.46 -0.03 0.12 0.02 0.03 -0.14
Table 7. Theta found for Normal Equations in FRIED
intercept s M s F s I l
Theta 0 0 78.51 73.55 27.08
d h ww shw vw sw
0.02 0.05 0.16 0.03 0.05 0.18
Table 8. Theta found for Gradient Descent in ABALONE
Theoretically, normal equations runs in O(n3) so it is
suitable for cases where n is not huge. For these 2 ex-
amples, Gradient Descent runs much faster for FRIED (at
least 5 times), while it took almost the same system time
for ABALONE datasource.
5. Conclusions and Future Work
Concerning FRIED dataset, very small values of alpha
0.001 or 0.003 made it did not work, probably due to a gra-
dient that made the algorithm to stop in some local minima
or to miss the global minimum. The values provided in the
study were the ones that made the algorithm not to be stuck
and found a reasonable cost. FRIED ”best” cost model was
found with a polynomial of degree 5 (cost=1.39), lambda=,
alpha=0.03, but it becames biased because the test set gave
an error of 18.49. Hence, the recommended model was de-
gree 1 (the minimum error on test set giving an error of
12.02).
Regarding ABALONE dataset, it was easier to study
from different perspectives and these tools were of great
help to assure quality. Firstly, cost over number of iterations
provided a way to see that this dataset took less iterations to
converge. Secondly, learning curves (error versus number
of instances) make it easier to see that the error was not so
high, hence avoiding bias, and the convergence garantees
that the variance is not so high, avoiding overfitting. Finally,
validation curves (error between training and cross valida-
tion/test sets versus lambda) which gave a pragmatic way
to select the ”best” lambda. The ”best ”model found from
a cost perspective was a polynomial of degree 4 also using
variable multiplication, lambda=0.001, alpha=0.03 (train-
ing cost of 3.64).
Strangely in both cases, regularization did not improve
the result, it seems a bug on the implementation as it does
not count with a grounded theory at least for ABALONE.
We will further study what went wrong with the Octave im-
plementation. For FRIED it could have some explanation
based on the fact that the dataset is created generatively (ar-
tificial dataset) with a scattered set of points in the dimen-
sional space.
As the datasets are not huge, from a performance per-
spective, Normal Equations gave the response in a reason-
able time, justifying its application due the exactness of the
results. It is easier than Gradient Descent because it does
not require training several parameters and making use of
cross-validation to avoid bias or overfitting.
However, as theory precognize that Normal Equations
runs in O(n3), it starts to degrade very quickly, as we can
see from an increase of 10 times in dataset, it increases at
least 100 times in system cpu time.
That is why, as a future work, we recommend to use a
greater dataset (at least 10-100 times) to compare the perfo-
mance results and effectively demonstrate Gradient Descent
use in these large dataset cases.
References
[1] J. FRIEDMAN. Multivariate adaptative regression splines.
Annals of Statistics, 1(19):1ˆa“–141, 1991. 1
[2] L. BREIMAN. Bagging predictors. machine learning. Kluwer
Academic Publishers, 3(24):123–140, 1996. 1
[3] Tony R. Martinezb D. Randall Wilsona. The general ineffi-
ciency of batch training for gradient descent learning. Neural
Networks, 16:1429ˆa“–1451, 2003. 1
[4] W.A. Gardner. Learning characteristics of stochastic-gradient-
descent algorithms: A general study, analysis, and critique.
Signal Processing, 6(2):113ˆa“–133, 1984. 1
[5] Samuel D. Stearns Bernard Widrow. Adaptive Signal Process-
ing. Prentice Hall, 1985. 1