SlideShare a Scribd company logo
1 of 4
Download to read offline
Linear Regression : comparision of Gradient Descent and Normal Equations
Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The following study will explore linear regression over
two datasources by comparing Gradient Descent (GD) and
Normal Equations techniques adjusting the parameters to
avoiding overfitting. The first dataset is called FRIED and
was proposed in Friedman [?] and Breiman [?], it com-
prises 40,768 cases, 10 attributes (0 nominal, 10 continu-
ous). The second dataset is called ABALONE comprising
4,177 cases, 8 attributes (1 categorical, 7 continuous).
2. Activities
There is a large discussion about the difference in perfor-
mance from Batch Gradient Descent versus On-line GD in
Wilsona and Martinez [?]. To overcome some of these dis-
advantages such as slowness to be close of the minimum,
one of the state-of-the-art approaches found in literature for
solving linear regression problems is mini batch Gradient
Descent. Another advanced algorithm is stochastic GD, one
discussion is available at Gardner [?]. Least but not last,
Least mean squares is a class of adaptive filter described in
the book by Widrow and Stearns [?].
3. Proposed Solutions
It was implemented two algorithms to deal with the prob-
lem. Both algorithms were developed in Octave language,
trying to apply vectorized implementation whenever is pos-
sible to try to achieve the best performance. As the first
approach, Gradient Descent using regularization (to avoid
overfitting) was created and used the following formula to
measure the cost function:
J(θ) = (1/2m)
m
i=1
((X[i]∗θ)−y[i]))2
+λ
n
j=1
θ[j]2
) (1)
The gradient was updated for theta at index 0 without us-
ing regularization parameter, while the other theta indexes
∗Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
were updated according to the formula below:
θ[j] = θ[j]−α
m
i=1
((X[i]∗θ)−y[i])∗X[i])+
λ
n
∗θ[j]) (2)
As a second approach, the Normal Equations algorithm
was implemented to compute the closed-form solution for
linear regression with the following code:
function [ t h e t a ] = normalEqn (X, y )
t h e t a = zeros ( s i z e (X, 2) , 1 ) ;
t h e t a = pinv (X’∗X)∗X’∗ y ;
end
In order to get the best model, it was developed some
code to plot learning curve (error x number of instances
the algorithm used) and validation curve (error over the λ).
With these graphs, we can selecte the best GD parameters
such as learning rate (α) or λ (regularization parameter).
The lambda values tested are: 0, 0.001, 0.003, 0.01, 0.03,
0.1, 0.3, 1 3, 10.
If the algorithm presents a biased behaviour, it was de-
veloped a simple code in R to test several polynomials in
the input data (x2, x3, and so on) and plot a graph to de-
tect the residuals and the multiplication of the two different
variables (nchoosek gives the binomial coefficient of n and
k) as the snippet below:
for z=1 : m
for i =1 : k
for j =1 : p %degree
column = ( ( i −1)∗k )+ j ;
X poly ( z , column ) = X( z , i ) ˆ j ;
end
end
end
c = nchoosek ( 1 : k , 2 ) ; %m u l t i p l i c a t i o n
for z=1 : m
for j =1: s i z e ( c , 1 )
column = ( k∗p ) + j ;
X poly ( z , column ) = X( z , c ( j , 1 ) ) ∗X( z , c ( j , 2 ) ) ;
end
end
1
4. Experiments and Discussion
We will describe each dataset experiment in a diferent
sub-section.
4.1. FRIED dataset
Firstly, the FRIED dataset provided has 2 pieces: train
(with 87.5 percent or 35000 cases), and test set (with 12.5
percent or 5000 instances). By running the GD using a sim-
ple polynomial (degree =1 for all input variables). The theta
found was the following:
intercept x1 x2 x3 x4
Theta 12964.21 1412.12 1472.49 -19.51 2193.17
x5 x6 x7 x8 x9 x10
1081.09 -9.42 36.58 7.3 12.13 -43.92
Table 1. Theta found for Gradient Descent in FRIED
The cost was monitored for two scenarios below:
Cost Parameters J
Without Regularization
Alpha=0.03
nbr. iterations=400
3.45
With Regularization
Alpha=0.03
nbr. iterations=400
lambda=0.03
115.94
Table 2. Cost with and without regularization for FRIED
As it can be seen by the table of costs, the regulariza-
tion did not improve the result of the Gradient Descent for
FRIED, particularly, it was considerably worse, the follow-
ing values were tried 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3,
10.
The following cost function vs. number of iterations was
displayed for the GD without regularization:
Figure 1. GD function cost without regularization for FRIED.
By looking at the error, it can be noticed that it was very
large even for a resonable number of iterations. So, we used
R lm function to look for the residuals and the R-squared
coefficient to find linear correlations among the variables
and the output. The following table summarizes the finding:
lm formula Resid. std error R-squared
y ∼ x1+x2+x3+x4+x5
+x6+x7+x8+x9+x10
5533 0.2507
Table 3. R lm result for FRIED
4.1.1 Adding multiplication and higher degree polyno-
mials
As the error was still high, we tried to increase model com-
plexity, in the case, adding new polymonial degrees and
multiplication of variables two-by-two. The training error
was plotted in the following graph:
Figure 2. Increasing model complexity for FRIED.
The ”best” model found was with a 5th degree poly-
nomial. The error found in training set was J=1.39 with
lambda=0, cross validation error=18.49. But this lead to a
biased model, as the cross validation increases, so we can
stay with model at degree 1 or 3. After that, it was plot-
ted for degree=5 some lambda values. The best value found
was lambda=0 (despite all values are very similar for train-
ing and cross validation error, lambda it does not seem to
influence).
4.2. ABALONE dataset
ABALONE dataset provided has 2 pieces: train (with 84
percent or 3500 cases), and test set (with 16 percent or 677
instances). As linear regression input needs to be contin-
uous variables, the attribute sex, which is categorical was
converted into 3 different columns (one for each possible
value: ”M”,”F”, and ”I”) which was valued to 1.0 if the cat-
egory is matching or 0.0 if not. By applying this procedure,
s variable was splitted into 3 numerical features called s M,
s F, and s I due to the 3 possible values for the variable s.
intercept s M s F s I l
Theta 10.02 0.28 0.46 -0.69 0.07
d h ww shw vw sw
0.28 0.35 0.38 0.09 0.28 0.44
Table 4. Theta found for Gradient Descent in ABALONE
Again the cost associated was calculated as shown be-
low:
Cost Parameters J
Without Regularization
Alpha=0.03
nbr. iterations=400
4.1
With Regularization
Alpha=0.03
nbr. iterations=400
lambda=0.03
55.75
Table 5. Cost with and without regularization for ABALONE
Running GD gaves the following cost x number of itera-
tions graph without using regularization:
Figure 3. GD function cost without regularization for ABALONE.
Using R lm function to look for the residuals and the
R-squared coefficient to find linear correlations among the
variables and the output gave the following result:
lm formula Resid. std error R-squared
ry ∼ s M+s F+s I+l
+d+h+ww+shw+vw+sw
2865 0.2641
Table 6. R lm result for ABALONE
The last graph suggests that the best lambda value found
for this regression was 0.03.
The value found by looking at the learning curve (error
versus number of training examples) is given below:
Figure 4. Learning curve (error versus nbr. training examples)for
ABALONE .
The curve indicates that with 100 examples the error was
already with a variance that is acceptable (because J(train)
and J(test) are very close at other.
4.2.1 Adding multiplication and higher degree polyno-
mials
Trying to increase model complexity, it was used the code
to generate new variables, that is, new polymonial degrees
and multiplication of variables two-by-two. The training
error was plotted in the following graph:
Figure 5. Increasing model complexity for ABALONE.
The ”best” model found was with a 4th degree polyno-
mial. The error found was J=3.64 with lambda=0. After
that, it was plotted for degree=4 some lambda values. The
best value found was lambda=0.001 (despite all values are
very similar).
Figure 6. Lambda values for polynomial degree=4 for ABALONE.
4.3. Gradient Descent (GD) x Normal Equations
(NE)
The machine used to run has an Intel 2-core processor
with 2,26Ghz and 4Gb RAM. Next table summarizes per-
formance for ABALONE and FRIED dataset for both Gra-
dient Descent and Normal Equation algorithms:
The theta found for FRIED and ABALONE dataset are
shown in the next page:
Algorithm Parameters nbr training instances numerical attributes Datasource System CPU time (seconds)
Gradient Descent
Alpha=0.03
nbr. iterations=400
lambda=0.03
35000 10 FRIED 20.95
Normal Equations 35000 10 FRIED 100.34
Gradient Descent
Alpha=0.03
nbr. iterations=400
lambda=0.03
3500 10 ABALONE 1.23
Normal Equations 3500 10 ABALONE 1.12
intercept x1 x2 x3 x4
Theta 4101.59 4.54 4.72 -0.06 7.03
x5 x6 x7 x8 x9 x10
3.46 -0.03 0.12 0.02 0.03 -0.14
Table 7. Theta found for Normal Equations in FRIED
intercept s M s F s I l
Theta 0 0 78.51 73.55 27.08
d h ww shw vw sw
0.02 0.05 0.16 0.03 0.05 0.18
Table 8. Theta found for Gradient Descent in ABALONE
Theoretically, normal equations runs in O(n3) so it is
suitable for cases where n is not huge. For these 2 ex-
amples, Gradient Descent runs much faster for FRIED (at
least 5 times), while it took almost the same system time
for ABALONE datasource.
5. Conclusions and Future Work
Concerning FRIED dataset, very small values of alpha
0.001 or 0.003 made it did not work, probably due to a gra-
dient that made the algorithm to stop in some local minima
or to miss the global minimum. The values provided in the
study were the ones that made the algorithm not to be stuck
and found a reasonable cost. FRIED ”best” cost model was
found with a polynomial of degree 5 (cost=1.39), lambda=,
alpha=0.03, but it becames biased because the test set gave
an error of 18.49. Hence, the recommended model was de-
gree 1 (the minimum error on test set giving an error of
12.02).
Regarding ABALONE dataset, it was easier to study
from different perspectives and these tools were of great
help to assure quality. Firstly, cost over number of iterations
provided a way to see that this dataset took less iterations to
converge. Secondly, learning curves (error versus number
of instances) make it easier to see that the error was not so
high, hence avoiding bias, and the convergence garantees
that the variance is not so high, avoiding overfitting. Finally,
validation curves (error between training and cross valida-
tion/test sets versus lambda) which gave a pragmatic way
to select the ”best” lambda. The ”best ”model found from
a cost perspective was a polynomial of degree 4 also using
variable multiplication, lambda=0.001, alpha=0.03 (train-
ing cost of 3.64).
Strangely in both cases, regularization did not improve
the result, it seems a bug on the implementation as it does
not count with a grounded theory at least for ABALONE.
We will further study what went wrong with the Octave im-
plementation. For FRIED it could have some explanation
based on the fact that the dataset is created generatively (ar-
tificial dataset) with a scattered set of points in the dimen-
sional space.
As the datasets are not huge, from a performance per-
spective, Normal Equations gave the response in a reason-
able time, justifying its application due the exactness of the
results. It is easier than Gradient Descent because it does
not require training several parameters and making use of
cross-validation to avoid bias or overfitting.
However, as theory precognize that Normal Equations
runs in O(n3), it starts to degrade very quickly, as we can
see from an increase of 10 times in dataset, it increases at
least 100 times in system cpu time.
That is why, as a future work, we recommend to use a
greater dataset (at least 10-100 times) to compare the perfo-
mance results and effectively demonstrate Gradient Descent
use in these large dataset cases.
References
[1] J. FRIEDMAN. Multivariate adaptative regression splines.
Annals of Statistics, 1(19):1ˆa“–141, 1991. 1
[2] L. BREIMAN. Bagging predictors. machine learning. Kluwer
Academic Publishers, 3(24):123–140, 1996. 1
[3] Tony R. Martinezb D. Randall Wilsona. The general ineffi-
ciency of batch training for gradient descent learning. Neural
Networks, 16:1429ˆa“–1451, 2003. 1
[4] W.A. Gardner. Learning characteristics of stochastic-gradient-
descent algorithms: A general study, analysis, and critique.
Signal Processing, 6(2):113ˆa“–133, 1984. 1
[5] Samuel D. Stearns Bernard Widrow. Adaptive Signal Process-
ing. Prentice Hall, 1985. 1

More Related Content

What's hot

Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...
Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...
Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...qyjewyvu
 
Chapter 2 surds and indicies
Chapter 2 surds and indiciesChapter 2 surds and indicies
Chapter 2 surds and indiciesSarah Sue Calbio
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite IntegralRich Elle
 
Linear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleLinear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleMarjan Sterjev
 
Analysis and design of algorithms part 4
Analysis and design of algorithms part 4Analysis and design of algorithms part 4
Analysis and design of algorithms part 4Deepak John
 
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…Dongseo University
 
Gauss Elimination & Gauss Jordan Methods in Numerical & Statistical Methods
Gauss Elimination & Gauss Jordan Methods in Numerical & Statistical MethodsGauss Elimination & Gauss Jordan Methods in Numerical & Statistical Methods
Gauss Elimination & Gauss Jordan Methods in Numerical & Statistical MethodsJanki Shah
 
5.vector geometry Further Mathematics Zimbabwe Zimsec Cambridge
5.vector geometry   Further Mathematics Zimbabwe Zimsec Cambridge5.vector geometry   Further Mathematics Zimbabwe Zimsec Cambridge
5.vector geometry Further Mathematics Zimbabwe Zimsec Cambridgealproelearning
 
Numerical Methods
Numerical MethodsNumerical Methods
Numerical MethodsESUG
 
Applied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, InterpolationApplied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, InterpolationBrian Erandio
 
MATLAB - Aplication of Arrays and Matrices in Electrical Systems
MATLAB - Aplication of Arrays and Matrices in Electrical SystemsMATLAB - Aplication of Arrays and Matrices in Electrical Systems
MATLAB - Aplication of Arrays and Matrices in Electrical SystemsShameer Ahmed Koya
 

What's hot (20)

ilovepdf_merged
ilovepdf_mergedilovepdf_merged
ilovepdf_merged
 
Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...
Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...
Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...
 
Signal Processing Homework Help
Signal Processing Homework HelpSignal Processing Homework Help
Signal Processing Homework Help
 
Probability Assignment Help
Probability Assignment HelpProbability Assignment Help
Probability Assignment Help
 
Chapter 2 surds and indicies
Chapter 2 surds and indiciesChapter 2 surds and indicies
Chapter 2 surds and indicies
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite Integral
 
Linear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation ExampleLinear Regression Ordinary Least Squares Distributed Calculation Example
Linear Regression Ordinary Least Squares Distributed Calculation Example
 
Signal Processing Assignment Help
Signal Processing Assignment HelpSignal Processing Assignment Help
Signal Processing Assignment Help
 
Analysis and design of algorithms part 4
Analysis and design of algorithms part 4Analysis and design of algorithms part 4
Analysis and design of algorithms part 4
 
Solution 3.
Solution 3.Solution 3.
Solution 3.
 
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
 
Applied mathematics 40
Applied mathematics 40Applied mathematics 40
Applied mathematics 40
 
Statistics Assignment Help
Statistics Assignment HelpStatistics Assignment Help
Statistics Assignment Help
 
Gauss Elimination & Gauss Jordan Methods in Numerical & Statistical Methods
Gauss Elimination & Gauss Jordan Methods in Numerical & Statistical MethodsGauss Elimination & Gauss Jordan Methods in Numerical & Statistical Methods
Gauss Elimination & Gauss Jordan Methods in Numerical & Statistical Methods
 
5.vector geometry Further Mathematics Zimbabwe Zimsec Cambridge
5.vector geometry   Further Mathematics Zimbabwe Zimsec Cambridge5.vector geometry   Further Mathematics Zimbabwe Zimsec Cambridge
5.vector geometry Further Mathematics Zimbabwe Zimsec Cambridge
 
Chap4
Chap4Chap4
Chap4
 
Numerical Methods
Numerical MethodsNumerical Methods
Numerical Methods
 
Data Analysis Homework Help
Data Analysis Homework HelpData Analysis Homework Help
Data Analysis Homework Help
 
Applied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, InterpolationApplied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
Applied Numerical Methods Curve Fitting: Least Squares Regression, Interpolation
 
MATLAB - Aplication of Arrays and Matrices in Electrical Systems
MATLAB - Aplication of Arrays and Matrices in Electrical SystemsMATLAB - Aplication of Arrays and Matrices in Electrical Systems
MATLAB - Aplication of Arrays and Matrices in Electrical Systems
 

Viewers also liked

Postcards final-all
Postcards final-allPostcards final-all
Postcards final-allHashevaynu
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_fariaPaulo Faria
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersDiminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersHarnoor Sanjeev
 
McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)Frank McDonald
 
2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-projectPaulo Faria
 
CaseStudyIndustryPresentation
CaseStudyIndustryPresentationCaseStudyIndustryPresentation
CaseStudyIndustryPresentationCharles Buie
 
Hashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual DinnerHashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual DinnerHashevaynu
 
Going Live April 2015
Going Live April 2015Going Live April 2015
Going Live April 2015Andrew Green
 
Service and guidance in education
Service and guidance in educationService and guidance in education
Service and guidance in educationWaqar Nisa
 

Viewers also liked (15)

Fa102 b
Fa102 bFa102 b
Fa102 b
 
Power Point
Power PointPower Point
Power Point
 
Postcards final-all
Postcards final-allPostcards final-all
Postcards final-all
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria
 
vjq_cv2015
vjq_cv2015vjq_cv2015
vjq_cv2015
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersDiminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
 
McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)
 
2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-project
 
Article_6
Article_6Article_6
Article_6
 
Rebellions Excerpt
Rebellions ExcerptRebellions Excerpt
Rebellions Excerpt
 
CaseStudyIndustryPresentation
CaseStudyIndustryPresentationCaseStudyIndustryPresentation
CaseStudyIndustryPresentation
 
Hashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual DinnerHashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual Dinner
 
Going Live April 2015
Going Live April 2015Going Live April 2015
Going Live April 2015
 
Service and guidance in education
Service and guidance in educationService and guidance in education
Service and guidance in education
 

Similar to 2014-mo444-practical-assignment-01-paulo_faria

Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...SSA KPI
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Investigación de operaciones 026 programación lineal Solución Simplex con R S...
Investigación de operaciones 026 programación lineal Solución Simplex con R S...Investigación de operaciones 026 programación lineal Solución Simplex con R S...
Investigación de operaciones 026 programación lineal Solución Simplex con R S...Jorge Pablo Rivas
 
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...GlobalLogic Ukraine
 
Size Measurement and Estimation
Size Measurement and EstimationSize Measurement and Estimation
Size Measurement and EstimationLouis A. Poulin
 
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docxCSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docxmydrynan
 
Solving the Poisson Equation
Solving the Poisson EquationSolving the Poisson Equation
Solving the Poisson EquationShahzaib Malik
 
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...Naoki Shibata
 
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Michael Lie
 
On image intensities, eigenfaces and LDA
On image intensities, eigenfaces and LDAOn image intensities, eigenfaces and LDA
On image intensities, eigenfaces and LDARaghu Palakodety
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegressionDaniel K
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectSurya Chandra
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd12345arjitcs
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep LearningSourya Dey
 

Similar to 2014-mo444-practical-assignment-01-paulo_faria (20)

main
mainmain
main
 
Automatic bayesian cubature
Automatic bayesian cubatureAutomatic bayesian cubature
Automatic bayesian cubature
 
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Investigación de operaciones 026 programación lineal Solución Simplex con R S...
Investigación de operaciones 026 programación lineal Solución Simplex con R S...Investigación de operaciones 026 programación lineal Solución Simplex con R S...
Investigación de operaciones 026 programación lineal Solución Simplex con R S...
 
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
 
Size Measurement and Estimation
Size Measurement and EstimationSize Measurement and Estimation
Size Measurement and Estimation
 
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docxCSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
 
EPE821_Lecture3.pptx
EPE821_Lecture3.pptxEPE821_Lecture3.pptx
EPE821_Lecture3.pptx
 
004
004004
004
 
Solving the Poisson Equation
Solving the Poisson EquationSolving the Poisson Equation
Solving the Poisson Equation
 
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
 
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
 
Signals and Systems Homework Help.pptx
Signals and Systems Homework Help.pptxSignals and Systems Homework Help.pptx
Signals and Systems Homework Help.pptx
 
On image intensities, eigenfaces and LDA
On image intensities, eigenfaces and LDAOn image intensities, eigenfaces and LDA
On image intensities, eigenfaces and LDA
 
SupportVectorRegression
SupportVectorRegressionSupportVectorRegression
SupportVectorRegression
 
Exploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems ProjectExploring Support Vector Regression - Signals and Systems Project
Exploring Support Vector Regression - Signals and Systems Project
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
 
004
004004
004
 
Techniques in Deep Learning
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
 

2014-mo444-practical-assignment-01-paulo_faria

  • 1. Linear Regression : comparision of Gradient Descent and Normal Equations Paulo Renato de Faria∗ Anderson Rocha† 1. Introduction The following study will explore linear regression over two datasources by comparing Gradient Descent (GD) and Normal Equations techniques adjusting the parameters to avoiding overfitting. The first dataset is called FRIED and was proposed in Friedman [?] and Breiman [?], it com- prises 40,768 cases, 10 attributes (0 nominal, 10 continu- ous). The second dataset is called ABALONE comprising 4,177 cases, 8 attributes (1 categorical, 7 continuous). 2. Activities There is a large discussion about the difference in perfor- mance from Batch Gradient Descent versus On-line GD in Wilsona and Martinez [?]. To overcome some of these dis- advantages such as slowness to be close of the minimum, one of the state-of-the-art approaches found in literature for solving linear regression problems is mini batch Gradient Descent. Another advanced algorithm is stochastic GD, one discussion is available at Gardner [?]. Least but not last, Least mean squares is a class of adaptive filter described in the book by Widrow and Stearns [?]. 3. Proposed Solutions It was implemented two algorithms to deal with the prob- lem. Both algorithms were developed in Octave language, trying to apply vectorized implementation whenever is pos- sible to try to achieve the best performance. As the first approach, Gradient Descent using regularization (to avoid overfitting) was created and used the following formula to measure the cost function: J(θ) = (1/2m) m i=1 ((X[i]∗θ)−y[i]))2 +λ n j=1 θ[j]2 ) (1) The gradient was updated for theta at index 0 without us- ing regularization parameter, while the other theta indexes ∗Is with the Institute of Computing, University of Campinas (Uni- camp). Contact: paulo.faria@gmail.com †Is with the Institute of Computing, University of Campinas (Uni- camp). Contact: anderson.rocha@ic.unicamp.br were updated according to the formula below: θ[j] = θ[j]−α m i=1 ((X[i]∗θ)−y[i])∗X[i])+ λ n ∗θ[j]) (2) As a second approach, the Normal Equations algorithm was implemented to compute the closed-form solution for linear regression with the following code: function [ t h e t a ] = normalEqn (X, y ) t h e t a = zeros ( s i z e (X, 2) , 1 ) ; t h e t a = pinv (X’∗X)∗X’∗ y ; end In order to get the best model, it was developed some code to plot learning curve (error x number of instances the algorithm used) and validation curve (error over the λ). With these graphs, we can selecte the best GD parameters such as learning rate (α) or λ (regularization parameter). The lambda values tested are: 0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1 3, 10. If the algorithm presents a biased behaviour, it was de- veloped a simple code in R to test several polynomials in the input data (x2, x3, and so on) and plot a graph to de- tect the residuals and the multiplication of the two different variables (nchoosek gives the binomial coefficient of n and k) as the snippet below: for z=1 : m for i =1 : k for j =1 : p %degree column = ( ( i −1)∗k )+ j ; X poly ( z , column ) = X( z , i ) ˆ j ; end end end c = nchoosek ( 1 : k , 2 ) ; %m u l t i p l i c a t i o n for z=1 : m for j =1: s i z e ( c , 1 ) column = ( k∗p ) + j ; X poly ( z , column ) = X( z , c ( j , 1 ) ) ∗X( z , c ( j , 2 ) ) ; end end 1
  • 2. 4. Experiments and Discussion We will describe each dataset experiment in a diferent sub-section. 4.1. FRIED dataset Firstly, the FRIED dataset provided has 2 pieces: train (with 87.5 percent or 35000 cases), and test set (with 12.5 percent or 5000 instances). By running the GD using a sim- ple polynomial (degree =1 for all input variables). The theta found was the following: intercept x1 x2 x3 x4 Theta 12964.21 1412.12 1472.49 -19.51 2193.17 x5 x6 x7 x8 x9 x10 1081.09 -9.42 36.58 7.3 12.13 -43.92 Table 1. Theta found for Gradient Descent in FRIED The cost was monitored for two scenarios below: Cost Parameters J Without Regularization Alpha=0.03 nbr. iterations=400 3.45 With Regularization Alpha=0.03 nbr. iterations=400 lambda=0.03 115.94 Table 2. Cost with and without regularization for FRIED As it can be seen by the table of costs, the regulariza- tion did not improve the result of the Gradient Descent for FRIED, particularly, it was considerably worse, the follow- ing values were tried 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10. The following cost function vs. number of iterations was displayed for the GD without regularization: Figure 1. GD function cost without regularization for FRIED. By looking at the error, it can be noticed that it was very large even for a resonable number of iterations. So, we used R lm function to look for the residuals and the R-squared coefficient to find linear correlations among the variables and the output. The following table summarizes the finding: lm formula Resid. std error R-squared y ∼ x1+x2+x3+x4+x5 +x6+x7+x8+x9+x10 5533 0.2507 Table 3. R lm result for FRIED 4.1.1 Adding multiplication and higher degree polyno- mials As the error was still high, we tried to increase model com- plexity, in the case, adding new polymonial degrees and multiplication of variables two-by-two. The training error was plotted in the following graph: Figure 2. Increasing model complexity for FRIED. The ”best” model found was with a 5th degree poly- nomial. The error found in training set was J=1.39 with lambda=0, cross validation error=18.49. But this lead to a biased model, as the cross validation increases, so we can stay with model at degree 1 or 3. After that, it was plot- ted for degree=5 some lambda values. The best value found was lambda=0 (despite all values are very similar for train- ing and cross validation error, lambda it does not seem to influence). 4.2. ABALONE dataset ABALONE dataset provided has 2 pieces: train (with 84 percent or 3500 cases), and test set (with 16 percent or 677 instances). As linear regression input needs to be contin- uous variables, the attribute sex, which is categorical was converted into 3 different columns (one for each possible value: ”M”,”F”, and ”I”) which was valued to 1.0 if the cat- egory is matching or 0.0 if not. By applying this procedure, s variable was splitted into 3 numerical features called s M, s F, and s I due to the 3 possible values for the variable s. intercept s M s F s I l Theta 10.02 0.28 0.46 -0.69 0.07 d h ww shw vw sw 0.28 0.35 0.38 0.09 0.28 0.44 Table 4. Theta found for Gradient Descent in ABALONE Again the cost associated was calculated as shown be- low:
  • 3. Cost Parameters J Without Regularization Alpha=0.03 nbr. iterations=400 4.1 With Regularization Alpha=0.03 nbr. iterations=400 lambda=0.03 55.75 Table 5. Cost with and without regularization for ABALONE Running GD gaves the following cost x number of itera- tions graph without using regularization: Figure 3. GD function cost without regularization for ABALONE. Using R lm function to look for the residuals and the R-squared coefficient to find linear correlations among the variables and the output gave the following result: lm formula Resid. std error R-squared ry ∼ s M+s F+s I+l +d+h+ww+shw+vw+sw 2865 0.2641 Table 6. R lm result for ABALONE The last graph suggests that the best lambda value found for this regression was 0.03. The value found by looking at the learning curve (error versus number of training examples) is given below: Figure 4. Learning curve (error versus nbr. training examples)for ABALONE . The curve indicates that with 100 examples the error was already with a variance that is acceptable (because J(train) and J(test) are very close at other. 4.2.1 Adding multiplication and higher degree polyno- mials Trying to increase model complexity, it was used the code to generate new variables, that is, new polymonial degrees and multiplication of variables two-by-two. The training error was plotted in the following graph: Figure 5. Increasing model complexity for ABALONE. The ”best” model found was with a 4th degree polyno- mial. The error found was J=3.64 with lambda=0. After that, it was plotted for degree=4 some lambda values. The best value found was lambda=0.001 (despite all values are very similar). Figure 6. Lambda values for polynomial degree=4 for ABALONE. 4.3. Gradient Descent (GD) x Normal Equations (NE) The machine used to run has an Intel 2-core processor with 2,26Ghz and 4Gb RAM. Next table summarizes per- formance for ABALONE and FRIED dataset for both Gra- dient Descent and Normal Equation algorithms: The theta found for FRIED and ABALONE dataset are shown in the next page:
  • 4. Algorithm Parameters nbr training instances numerical attributes Datasource System CPU time (seconds) Gradient Descent Alpha=0.03 nbr. iterations=400 lambda=0.03 35000 10 FRIED 20.95 Normal Equations 35000 10 FRIED 100.34 Gradient Descent Alpha=0.03 nbr. iterations=400 lambda=0.03 3500 10 ABALONE 1.23 Normal Equations 3500 10 ABALONE 1.12 intercept x1 x2 x3 x4 Theta 4101.59 4.54 4.72 -0.06 7.03 x5 x6 x7 x8 x9 x10 3.46 -0.03 0.12 0.02 0.03 -0.14 Table 7. Theta found for Normal Equations in FRIED intercept s M s F s I l Theta 0 0 78.51 73.55 27.08 d h ww shw vw sw 0.02 0.05 0.16 0.03 0.05 0.18 Table 8. Theta found for Gradient Descent in ABALONE Theoretically, normal equations runs in O(n3) so it is suitable for cases where n is not huge. For these 2 ex- amples, Gradient Descent runs much faster for FRIED (at least 5 times), while it took almost the same system time for ABALONE datasource. 5. Conclusions and Future Work Concerning FRIED dataset, very small values of alpha 0.001 or 0.003 made it did not work, probably due to a gra- dient that made the algorithm to stop in some local minima or to miss the global minimum. The values provided in the study were the ones that made the algorithm not to be stuck and found a reasonable cost. FRIED ”best” cost model was found with a polynomial of degree 5 (cost=1.39), lambda=, alpha=0.03, but it becames biased because the test set gave an error of 18.49. Hence, the recommended model was de- gree 1 (the minimum error on test set giving an error of 12.02). Regarding ABALONE dataset, it was easier to study from different perspectives and these tools were of great help to assure quality. Firstly, cost over number of iterations provided a way to see that this dataset took less iterations to converge. Secondly, learning curves (error versus number of instances) make it easier to see that the error was not so high, hence avoiding bias, and the convergence garantees that the variance is not so high, avoiding overfitting. Finally, validation curves (error between training and cross valida- tion/test sets versus lambda) which gave a pragmatic way to select the ”best” lambda. The ”best ”model found from a cost perspective was a polynomial of degree 4 also using variable multiplication, lambda=0.001, alpha=0.03 (train- ing cost of 3.64). Strangely in both cases, regularization did not improve the result, it seems a bug on the implementation as it does not count with a grounded theory at least for ABALONE. We will further study what went wrong with the Octave im- plementation. For FRIED it could have some explanation based on the fact that the dataset is created generatively (ar- tificial dataset) with a scattered set of points in the dimen- sional space. As the datasets are not huge, from a performance per- spective, Normal Equations gave the response in a reason- able time, justifying its application due the exactness of the results. It is easier than Gradient Descent because it does not require training several parameters and making use of cross-validation to avoid bias or overfitting. However, as theory precognize that Normal Equations runs in O(n3), it starts to degrade very quickly, as we can see from an increase of 10 times in dataset, it increases at least 100 times in system cpu time. That is why, as a future work, we recommend to use a greater dataset (at least 10-100 times) to compare the perfo- mance results and effectively demonstrate Gradient Descent use in these large dataset cases. References [1] J. FRIEDMAN. Multivariate adaptative regression splines. Annals of Statistics, 1(19):1ˆa“–141, 1991. 1 [2] L. BREIMAN. Bagging predictors. machine learning. Kluwer Academic Publishers, 3(24):123–140, 1996. 1 [3] Tony R. Martinezb D. Randall Wilsona. The general ineffi- ciency of batch training for gradient descent learning. Neural Networks, 16:1429ˆa“–1451, 2003. 1 [4] W.A. Gardner. Learning characteristics of stochastic-gradient- descent algorithms: A general study, analysis, and critique. Signal Processing, 6(2):113ˆa“–133, 1984. 1 [5] Samuel D. Stearns Bernard Widrow. Adaptive Signal Process- ing. Prentice Hall, 1985. 1