SlideShare a Scribd company logo
1 of 8
Download to read offline
Generalized Linear Regression with
Regularization
Zoya Byliskii
March 3, 2015
1 BASIC REGRESSION PROBLEM
Note: In the following notes I will make explicit what is a vector and what is a scalar using
vector notation, to avoid confusion between variables. I will occasionally expand out the vec-
tor notation to make the linear algebra operations more explicit.
You are given n data points: d-dimensional feature vectors and corresponding real-valued
labels {x(t)
, y(t)
}, t = {1,..,n}. Your goal is to find a linear combination of the features (feature
vectors elements/coordinates) that is best able to predict each label. In other words, you want
to discover the parameter vector θ that allows you to make the most accurate predictions:
x(t)
·θ ≈ y(t)
,∀t = {1,..,n} (1.1)
Note: we are ignoring offset for now, and solving a problem without constant offset θo.
In other words, we want to approximate the labels we have for our n data points by finding
the best-fitting θ. Expanding out the vector notation in eq. 1.1, we want:
x(1)
1 θ1 + x(1)
2 θ2 +···+ x(1)
d
θd ≈ y(1)
x(2)
1 θ1 + x(2)
2 θ2 +···+ x(2)
d
θd ≈ y(2)
...
x(n)
1 θ1 + x(n)
2 θ2 +···+ x(n)
d
θd ≈ y(n)
1
This is just a linear system of n equations in d unknowns. So, we can write this in matrix form:






x(1)
x(2)
...
x(n)









θ1
...
θd


 ≈






y(1)
y(2)
...
y(n)






(1.2)
Or more simply as:
X θ ≈ y (1.3)
Where X is our data matrix.
Note: the horizontal lines in the matrix help make explicit which way the vectors are stacked
in the matrix. In this case, our feature vectors x(t)
make up the rows of our matrix, and the
individual features/coordinates are the columns.
Consider the dimensions of our system:
What happens if n = d?
X is square matrix, and as long as all the data points (rows) of X are linearly independent,
then X is invertible and we can solve for θ exactly:
θ = X −1
y (1.4)
Of course a θ that exactly fits all the training data is not likely to generalize to a test set (a
novel set of data: new feature vectors x(t)
). Thus we might want to impose some regulariza-
tion to avoid overfitting (will be discussed later in this document), or to subsample the data
or perform some cross-validation (leave out part of the training data when solving for the pa-
rameter θ).
2
What happens if n < d?
In this case, we have fewer data points than feature dimensions - less equations than un-
knowns. Think: how many degree-3 polynomials can pass through 2 points? An infinite
number. Similarly, in this case, we can have an infinite number of solutions. Our problem
is undefined in this case. What can we do? Well, we can collect more data, we can replicate
our existing data while adding some noise to it, or we can apply some methods of feature se-
lection (a research area in machine learning) to select out the features (columns of our data
matrix) that carry the most weight for our problem. We have to regularize even more severely
in this case, because there’s an infinite number of ways to overfit the training data.
What happens if n > d?
Now we have more data than features, so the more data we have, the less likely we are to
overfit the training set. In this case, we will not be able to find an exact solution: a θ that
satisfies eq. 1.1. Instead, we will look for a θ that is best in the least-squares sense. It turns
out, as we will see later in this document, that this θ can be obtained by solving the modified
system of equations:
X T
X θ = X T
y (1.5)
This is called the normal equations. Notice that X T
X is a square d ×d, invertible matrix 1
.
We can now solve for θ as follows:
θ = (X T
X )−1
X T
y (1.6)
Where (X T
X )−1
X T
is often denoted as X +
and is called the pseudoinverse of X .
1An easy-to-follow proof is provided here: <https://www.khanacademy.org/math/linear-algebra/
matrix_transformations/matrix_transpose/v/lin-alg-showing-that-a-transpose-x-a-is-invertible>
3
Let us now arrive at this solution by direct optimization of the least squares objective:
J(θ) =
1
n
n
t=1
1
2
y(t)
−θ · x(t)
2
(1.7)
Notice that this is just a way of summarizing the constraints in eq. 1.1. We want θ·x(t)
to come
as close as possible to y(t)
for all t, so we minimize the sum of squared errors.
Note: the 1
2 term is only for notational convenience (for taking derivatives). It is just a constant
scaling factor of the final θ we obtain. Also, the 1
n term is not important for this form of re-
gression without regularization, as it will cancel out. In the form with regularization, it just
rescales the regularization parameter by the amount of data points n (we’ll see this later in this
document). In any case, you might see formulations of regression with or without this term,
but this will not make a big difference to the general form of the problem.
We want to minimize J(θ), and so we set the gradient of this function to zero: ∇θ(J(θ)) = 0.
This is equivalent to solving the following system of linear equations:







∂
∂θ1
(J(θ))
∂
∂θ2
(J(θ))
...
∂
∂θd
(J(θ))







=






0
0
...
0






(1.8)
Let’s consider a single one of these equations: i.e. solve for a particular partial ∂
∂θi
(J(θ)). For
clarity, consider expanding the vector notation in 1.7:
J(θ) =
1
n
n
t=1
1
2
y(t)
− x(t)
1 θ1 + x(t)
2 θ2 +···+ x(t)
d
θd
2
(1.9)
Thus, by the chain rule:
∂
∂θi
(J(θ)) =
1
n
n
t=1
y(t)
− x(t)
1 θ1 + x(t)
2 θ2 +···+ x(t)
d
θd −x(t)
i
(1.10)
=
1
n
n
t=1
y(t)
− x(t)
T
θ −x(t)
i
(1.11)
Note: we just rewrote the dot product θ · x(t)
in equivalent matrix form: x(t)
T
θ because we
will be putting our whole system of equations in matrix form in the following calculations.
Since we want to set all the partials to 0, it follows that:
1
n
n
t=1
y(t)
− x(t)
T
θ −x(t)
i
= 0 (1.12)
1
n
n
t=1
x(t)
i
x(t)
T
θ =
1
n
n
t=1
x(t)
i
y(t)
(1.13)
4
So we can rewrite the system in eq. 1.14 as:









1
n
n
t=1 x(t)
1 x(t)
T
θ
1
n
n
t=1 x(t)
2 x(t)
T
θ
...
1
n
n
t=1 x(t)
d
x(t)
T
θ









=






1
n
n
t=1 x(t)
1 y(t)
1
n
n
t=1 x(t)
2 y(t)
...
1
n
n
t=1 x(t)
d
y(t)






(1.14)
Which is equivalent to the following condensed form:
1
n
n
t=1
x(t)
x(t)
T
θ =
1
n
n
t=1
x(t)
y(t)
(1.15)
Dropping the 1
n term:
n
t=1
x(t)
x(t)
T
θ =
n
t=1
x(t)
y(t)
(1.16)
In other words, this equation is what we obtain by setting ∇θ(J(θ)) = 0.
Let us write this equation out explicitly to see how we can rewrite it further in matrix form.
First, consider writing out the left-hand-size of eq. 1.16:






x(1)
1
x(1)
2
...
x(1)
d






x(1)
1 x(1)
2 ...x(1)
d
θ +






x(2)
1
x(2)
2
...
x(2)
d






x(2)
1 x(2)
2 ...x(2)
d
θ +...+






x(n)
1
x(n)
2
...
x(n)
d






x(n)
1 x(n)
2 ...x(n)
d
θ (1.17)
Convince yourself that this is just:


x(1)
x(2)
... x(n)









x(1)
x(2)
...
x(n)






θ (1.18)
These matrices should be familiar from before. We can rewrite eq. 1.18 as X T
X θ.
Next, consider writing out the right-hand-size of eq. 1.16:






x(1)
1
x(1)
2
...
x(1)
d






y(1)
+






x(2)
1
x(2)
2
...
x(2)
d






y(2)
+...+






x(n)
1
x(n)
2
...
x(n)
d






y(n)
(1.19)
5
Convince yourself that this is just:


x(1)
x(2)
... x(n)


 y (1.20)
Which is nothing more than X T
y.
Putting together eq. 1.18 and eq. 1.20, we get exactly eq. 1.5, and so:
θ = (X T
X )−1
X T
y
as the least-squares solution of our regression problem (no offset, no regularization).
2 REGRESSION WITH REGULARIZATION
If we overfit our training data, our predictions may not generalize very well to novel test data.
For instance, if our training data is somewhat noisy (real measurements are always somewhat
noisy!), we don’t want to fit our training data perfectly - otherwise we might generate very bad
predictions for real signal. It is worse to overshoot than undershoot predictions when errors
are measured by squared error. Being more guarded in how you predict, i.e., keeping θ small,
helps reduce generalization error. We can enforce the constraint to keep θ small by adding a
regularization term to 1.7:
J(θ) =
1
n
n
t=1
1
2
y(t)
−θ · x(t)
2
+
λ
2
θ 2
(2.1)
Since λ
2 θ 2
= λ
2 θ2
1 + λ
2 θ2
2 +···+ λ
2 θ2
d
this contributes a single additional term to eq. 1.11:
∂
∂θi
(J(θ)) =
1
n
n
t=1
y(t)
− x(t)
T
θ −x(t)
i
+λθi (2.2)
So similarly to eq. 1.13, setting the partial to zero:
1
n
n
t=1
x(t)
i
x(t)
T
θ+λθi =
1
n
n
t=1
x(t)
i
y(t)
(2.3)
Thus, collapsing the equations for all partials:
1
n
n
t=1
x(t)
x(t)
T
θ+λθ =
1
n
n
t=1
x(t)
y(t)
(2.4)
And in matrix form:
6
1
n
X T
X θ+λθ =
1
n
X T
y (2.5)
1
n
X T
X +λI θ =
1
n
X T
y (2.6)
(2.7)
Which gives us the solution of our least-squares regression problem with regularization:
θ =
1
n
X T
X +λI
−1
1
n
X T
y
Note: You can see that the 1
n term was not dropped in the regularized form of the problem, but
is equivalent to a rescaling of λ.
Aside: how do we know 1
n X T
X +λI is invertible? Here is a rough proof outline:
• First note that X T
X is positive semidefinite
proof: uT
X T
Xu = ||Xu||2
≥ 0 so all eigenvalues must be ≥ 0
• The eigenvalues of X T
X +λI are µi +λ, where µi are eigenvalues of X T
X
proof: X T
Xu = µu implies (X T
X +λI)u = (µ+λ)u
• All eigenvalues of X T
X are strictly positive, so it must be invertible
proof: λ > 0, and so µi +λ > 0 for every i
3 REGRESSION WITH REGULARIZATION AND OFFSET
The offset parameter can be thought of as adjusting to the magnitude of the data. It is a
single scalar, and we do not want to penalize its size during regularization, because a large
offset might allow many of the other parameters to be much smaller and still provide a good
fit for the data. The calculation for the least-squares solution with regularization and offset is
similar to the calculation without offset.
J(θ) =
1
n
n
t=1
1
2
y(t)
− θ · x(t)
+θo
2
+
λ
2
θ 2
(3.1)
Setting ∇θ(J(θ)) = 0 as before, we have:
1
n
n
t=1
x(t)
x(t)
T
θ+
1
n
θo
n
t=1
x(t)
+λθ =
1
n
n
t=1
x(t)
y(t)
(3.2)
And in matrix form:
7
1
n
X T
X θ+
1
n
θo X T
1+λθ =
1
n
X T
y (3.3)
1
n
X T
X +λ θ +
1
n
X T
1 θo =
1
n
X T
y (3.4)
Where 1 is the vector composed of all ones (here with dimension n×1), and so X T
1 = n
t=1 x(t)
.
We separately minimize J(θ) with respect to θo (which does not appear in the regularizer):
∂
∂θo
(J(θ)) =
1
n
n
t=1
y(t)
− x(t)
T
θ −θo (−1) (3.5)
Setting this partial to 0 we get:
θo =
1
n
n
t=1
y(t)
− x(t)
T
θ (3.6)
=
1
n
yT
1−
1
n
X θ
T
1 (3.7)
Note that we now have 2 equations (3.4 and 3.7) in 2 unknowns: θ and θo that can be solved
algebraically.
4 ADDITIONAL RESOURCES
• Matrix cookbook: <http://www.mit.edu/~wingated/stuff_i_use/matrix_cookbook.
pdf>.
• Matrix approach to linear regression with a statistical perspective: <http://www.maths.
qmul.ac.uk/~bb/SM_I_2013_LecturesWeek_6.pdf>.
8

More Related Content

What's hot

Systems Of Differential Equations
Systems Of Differential EquationsSystems Of Differential Equations
Systems Of Differential EquationsJDagenais
 
Numerical Solution of Nth - Order Fuzzy Initial Value Problems by Fourth Orde...
Numerical Solution of Nth - Order Fuzzy Initial Value Problems by Fourth Orde...Numerical Solution of Nth - Order Fuzzy Initial Value Problems by Fourth Orde...
Numerical Solution of Nth - Order Fuzzy Initial Value Problems by Fourth Orde...IOSR Journals
 
Existance Theory for First Order Nonlinear Random Dfferential Equartion
Existance Theory for First Order Nonlinear Random Dfferential EquartionExistance Theory for First Order Nonlinear Random Dfferential Equartion
Existance Theory for First Order Nonlinear Random Dfferential Equartioninventionjournals
 
On the Application of the Fixed Point Theory to the Solution of Systems of Li...
On the Application of the Fixed Point Theory to the Solution of Systems of Li...On the Application of the Fixed Point Theory to the Solution of Systems of Li...
On the Application of the Fixed Point Theory to the Solution of Systems of Li...BRNSS Publication Hub
 
Digital text book
Digital text bookDigital text book
Digital text bookshareenapr5
 
Numarical values
Numarical valuesNumarical values
Numarical valuesAmanSaeed11
 
Numarical values highlighted
Numarical values highlightedNumarical values highlighted
Numarical values highlightedAmanSaeed11
 
Lesson 31: The Simplex Method, I
Lesson 31: The Simplex Method, ILesson 31: The Simplex Method, I
Lesson 31: The Simplex Method, IMatthew Leingang
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceresearchinventy
 
Bounded variables new
Bounded variables newBounded variables new
Bounded variables newSamo Alwatan
 
Elementary differential equation
Elementary differential equationElementary differential equation
Elementary differential equationAngeli Castillo
 
Chapter 1: First-Order Ordinary Differential Equations/Slides
Chapter 1: First-Order Ordinary Differential Equations/Slides Chapter 1: First-Order Ordinary Differential Equations/Slides
Chapter 1: First-Order Ordinary Differential Equations/Slides Chaimae Baroudi
 
02 first order differential equations
02 first order differential equations02 first order differential equations
02 first order differential equationsvansi007
 

What's hot (18)

Systems Of Differential Equations
Systems Of Differential EquationsSystems Of Differential Equations
Systems Of Differential Equations
 
Numerical Solution of Nth - Order Fuzzy Initial Value Problems by Fourth Orde...
Numerical Solution of Nth - Order Fuzzy Initial Value Problems by Fourth Orde...Numerical Solution of Nth - Order Fuzzy Initial Value Problems by Fourth Orde...
Numerical Solution of Nth - Order Fuzzy Initial Value Problems by Fourth Orde...
 
Ch07 6
Ch07 6Ch07 6
Ch07 6
 
Systems of equations and matricies
Systems of equations and matriciesSystems of equations and matricies
Systems of equations and matricies
 
Existance Theory for First Order Nonlinear Random Dfferential Equartion
Existance Theory for First Order Nonlinear Random Dfferential EquartionExistance Theory for First Order Nonlinear Random Dfferential Equartion
Existance Theory for First Order Nonlinear Random Dfferential Equartion
 
On the Application of the Fixed Point Theory to the Solution of Systems of Li...
On the Application of the Fixed Point Theory to the Solution of Systems of Li...On the Application of the Fixed Point Theory to the Solution of Systems of Li...
On the Application of the Fixed Point Theory to the Solution of Systems of Li...
 
numerical methods
numerical methodsnumerical methods
numerical methods
 
Digital text book
Digital text bookDigital text book
Digital text book
 
Shareena p r
Shareena p r Shareena p r
Shareena p r
 
Numarical values
Numarical valuesNumarical values
Numarical values
 
Numarical values highlighted
Numarical values highlightedNumarical values highlighted
Numarical values highlighted
 
Lesson 31: The Simplex Method, I
Lesson 31: The Simplex Method, ILesson 31: The Simplex Method, I
Lesson 31: The Simplex Method, I
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Bounded variables new
Bounded variables newBounded variables new
Bounded variables new
 
Advance algebra
Advance algebraAdvance algebra
Advance algebra
 
Elementary differential equation
Elementary differential equationElementary differential equation
Elementary differential equation
 
Chapter 1: First-Order Ordinary Differential Equations/Slides
Chapter 1: First-Order Ordinary Differential Equations/Slides Chapter 1: First-Order Ordinary Differential Equations/Slides
Chapter 1: First-Order Ordinary Differential Equations/Slides
 
02 first order differential equations
02 first order differential equations02 first order differential equations
02 first order differential equations
 

Viewers also liked

Audience research results and conclusions
Audience research results and conclusionsAudience research results and conclusions
Audience research results and conclusionsrascalqihwfelihesf
 
Position of the Mental Foramen in a Northern Regional Palestinian Population
Position of the Mental Foramen in a Northern Regional Palestinian PopulationPosition of the Mental Foramen in a Northern Regional Palestinian Population
Position of the Mental Foramen in a Northern Regional Palestinian PopulationAbu-Hussein Muhamad
 
El 78% de ejecutivos pide un gobernante con un equipo competente y honesto. E...
El 78% de ejecutivos pide un gobernante con un equipo competente y honesto. E...El 78% de ejecutivos pide un gobernante con un equipo competente y honesto. E...
El 78% de ejecutivos pide un gobernante con un equipo competente y honesto. E...HEAD HUNTERS PERÚ
 
Careif ps culturally adapted interventions in mental health. series 1
Careif ps culturally adapted interventions in mental health. series 1Careif ps culturally adapted interventions in mental health. series 1
Careif ps culturally adapted interventions in mental health. series 1MrBiswas
 
2014 thailand's digital media scene by m interaction
2014 thailand's digital media scene by m interaction2014 thailand's digital media scene by m interaction
2014 thailand's digital media scene by m interactionNeil Mavichak
 
Seferihisar 130821073329-phpapp01
Seferihisar 130821073329-phpapp01Seferihisar 130821073329-phpapp01
Seferihisar 130821073329-phpapp01R.G. (Bob) Tuss
 
What is bullying behaviour abw 2015
What is bullying behaviour abw 2015What is bullying behaviour abw 2015
What is bullying behaviour abw 2015Cybertrain20
 
Unidad 2. creación_de_un_escenario_de_simulación_con_jav afx_español
Unidad 2. creación_de_un_escenario_de_simulación_con_jav afx_españolUnidad 2. creación_de_un_escenario_de_simulación_con_jav afx_español
Unidad 2. creación_de_un_escenario_de_simulación_con_jav afx_españolMarisa Torrecillas
 
Immediate Implants Placed Into Infected Sockets: Clinical Update with 3-Year ...
Immediate Implants Placed Into Infected Sockets: Clinical Update with 3-Year ...Immediate Implants Placed Into Infected Sockets: Clinical Update with 3-Year ...
Immediate Implants Placed Into Infected Sockets: Clinical Update with 3-Year ...Abu-Hussein Muhamad
 
Citrix Solutions for Healthcare
Citrix Solutions for HealthcareCitrix Solutions for Healthcare
Citrix Solutions for HealthcareCitrix
 

Viewers also liked (13)

Gestión de proyectos mapa
Gestión de proyectos mapaGestión de proyectos mapa
Gestión de proyectos mapa
 
Audience research results and conclusions
Audience research results and conclusionsAudience research results and conclusions
Audience research results and conclusions
 
Position of the Mental Foramen in a Northern Regional Palestinian Population
Position of the Mental Foramen in a Northern Regional Palestinian PopulationPosition of the Mental Foramen in a Northern Regional Palestinian Population
Position of the Mental Foramen in a Northern Regional Palestinian Population
 
El 78% de ejecutivos pide un gobernante con un equipo competente y honesto. E...
El 78% de ejecutivos pide un gobernante con un equipo competente y honesto. E...El 78% de ejecutivos pide un gobernante con un equipo competente y honesto. E...
El 78% de ejecutivos pide un gobernante con un equipo competente y honesto. E...
 
Careif ps culturally adapted interventions in mental health. series 1
Careif ps culturally adapted interventions in mental health. series 1Careif ps culturally adapted interventions in mental health. series 1
Careif ps culturally adapted interventions in mental health. series 1
 
2014 thailand's digital media scene by m interaction
2014 thailand's digital media scene by m interaction2014 thailand's digital media scene by m interaction
2014 thailand's digital media scene by m interaction
 
Seferihisar 130821073329-phpapp01
Seferihisar 130821073329-phpapp01Seferihisar 130821073329-phpapp01
Seferihisar 130821073329-phpapp01
 
What is bullying behaviour abw 2015
What is bullying behaviour abw 2015What is bullying behaviour abw 2015
What is bullying behaviour abw 2015
 
HƠI THỞ KỲ DIỆU
HƠI THỞ KỲ DIỆUHƠI THỞ KỲ DIỆU
HƠI THỞ KỲ DIỆU
 
Unidad 2. creación_de_un_escenario_de_simulación_con_jav afx_español
Unidad 2. creación_de_un_escenario_de_simulación_con_jav afx_españolUnidad 2. creación_de_un_escenario_de_simulación_con_jav afx_español
Unidad 2. creación_de_un_escenario_de_simulación_con_jav afx_español
 
Immediate Implants Placed Into Infected Sockets: Clinical Update with 3-Year ...
Immediate Implants Placed Into Infected Sockets: Clinical Update with 3-Year ...Immediate Implants Placed Into Infected Sockets: Clinical Update with 3-Year ...
Immediate Implants Placed Into Infected Sockets: Clinical Update with 3-Year ...
 
Queston 3
Queston 3Queston 3
Queston 3
 
Citrix Solutions for Healthcare
Citrix Solutions for HealthcareCitrix Solutions for Healthcare
Citrix Solutions for Healthcare
 

Similar to Linear regression

NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALESNONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALESTahia ZERIZER
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieMarco Moldenhauer
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)Eric Zhang
 
CS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesCS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesEric Conner
 
MATLAB ODE
MATLAB ODEMATLAB ODE
MATLAB ODEKris014
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)NYversity
 
Complete l fuzzy metric spaces and common fixed point theorems
Complete l fuzzy metric spaces and  common fixed point theoremsComplete l fuzzy metric spaces and  common fixed point theorems
Complete l fuzzy metric spaces and common fixed point theoremsAlexander Decker
 
07 chap3
07 chap307 chap3
07 chap3ELIMENG
 
from_data_to_differential_equations.ppt
from_data_to_differential_equations.pptfrom_data_to_differential_equations.ppt
from_data_to_differential_equations.pptashutoshvb1
 
microproject@math (1).pdf
microproject@math (1).pdfmicroproject@math (1).pdf
microproject@math (1).pdfAthrvaKumkar
 

Similar to Linear regression (20)

SAS Homework Help
SAS Homework HelpSAS Homework Help
SAS Homework Help
 
02 Notes Divide and Conquer
02 Notes Divide and Conquer02 Notes Divide and Conquer
02 Notes Divide and Conquer
 
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALESNONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
NONLINEAR DIFFERENCE EQUATIONS WITH SMALL PARAMETERS OF MULTIPLE SCALES
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
 
Statistical Method In Economics
Statistical Method In EconomicsStatistical Method In Economics
Statistical Method In Economics
 
SURF 2012 Final Report(1)
SURF 2012 Final Report(1)SURF 2012 Final Report(1)
SURF 2012 Final Report(1)
 
CS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture NotesCS229 Machine Learning Lecture Notes
CS229 Machine Learning Lecture Notes
 
MATLAB ODE
MATLAB ODEMATLAB ODE
MATLAB ODE
 
Machine learning (1)
Machine learning (1)Machine learning (1)
Machine learning (1)
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Complete l fuzzy metric spaces and common fixed point theorems
Complete l fuzzy metric spaces and  common fixed point theoremsComplete l fuzzy metric spaces and  common fixed point theorems
Complete l fuzzy metric spaces and common fixed point theorems
 
Signal Processing Homework Help
Signal Processing Homework HelpSignal Processing Homework Help
Signal Processing Homework Help
 
Stochastic Processes Assignment Help
Stochastic Processes Assignment HelpStochastic Processes Assignment Help
Stochastic Processes Assignment Help
 
07 chap3
07 chap307 chap3
07 chap3
 
Networking Assignment Help
Networking Assignment HelpNetworking Assignment Help
Networking Assignment Help
 
Statistics lab 1
Statistics lab 1Statistics lab 1
Statistics lab 1
 
stochastic processes assignment help
stochastic processes assignment helpstochastic processes assignment help
stochastic processes assignment help
 
from_data_to_differential_equations.ppt
from_data_to_differential_equations.pptfrom_data_to_differential_equations.ppt
from_data_to_differential_equations.ppt
 
microproject@math (1).pdf
microproject@math (1).pdfmicroproject@math (1).pdf
microproject@math (1).pdf
 
Ch07 9
Ch07 9Ch07 9
Ch07 9
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 

Linear regression

  • 1. Generalized Linear Regression with Regularization Zoya Byliskii March 3, 2015 1 BASIC REGRESSION PROBLEM Note: In the following notes I will make explicit what is a vector and what is a scalar using vector notation, to avoid confusion between variables. I will occasionally expand out the vec- tor notation to make the linear algebra operations more explicit. You are given n data points: d-dimensional feature vectors and corresponding real-valued labels {x(t) , y(t) }, t = {1,..,n}. Your goal is to find a linear combination of the features (feature vectors elements/coordinates) that is best able to predict each label. In other words, you want to discover the parameter vector θ that allows you to make the most accurate predictions: x(t) ·θ ≈ y(t) ,∀t = {1,..,n} (1.1) Note: we are ignoring offset for now, and solving a problem without constant offset θo. In other words, we want to approximate the labels we have for our n data points by finding the best-fitting θ. Expanding out the vector notation in eq. 1.1, we want: x(1) 1 θ1 + x(1) 2 θ2 +···+ x(1) d θd ≈ y(1) x(2) 1 θ1 + x(2) 2 θ2 +···+ x(2) d θd ≈ y(2) ... x(n) 1 θ1 + x(n) 2 θ2 +···+ x(n) d θd ≈ y(n) 1
  • 2. This is just a linear system of n equations in d unknowns. So, we can write this in matrix form:       x(1) x(2) ... x(n)          θ1 ... θd    ≈       y(1) y(2) ... y(n)       (1.2) Or more simply as: X θ ≈ y (1.3) Where X is our data matrix. Note: the horizontal lines in the matrix help make explicit which way the vectors are stacked in the matrix. In this case, our feature vectors x(t) make up the rows of our matrix, and the individual features/coordinates are the columns. Consider the dimensions of our system: What happens if n = d? X is square matrix, and as long as all the data points (rows) of X are linearly independent, then X is invertible and we can solve for θ exactly: θ = X −1 y (1.4) Of course a θ that exactly fits all the training data is not likely to generalize to a test set (a novel set of data: new feature vectors x(t) ). Thus we might want to impose some regulariza- tion to avoid overfitting (will be discussed later in this document), or to subsample the data or perform some cross-validation (leave out part of the training data when solving for the pa- rameter θ). 2
  • 3. What happens if n < d? In this case, we have fewer data points than feature dimensions - less equations than un- knowns. Think: how many degree-3 polynomials can pass through 2 points? An infinite number. Similarly, in this case, we can have an infinite number of solutions. Our problem is undefined in this case. What can we do? Well, we can collect more data, we can replicate our existing data while adding some noise to it, or we can apply some methods of feature se- lection (a research area in machine learning) to select out the features (columns of our data matrix) that carry the most weight for our problem. We have to regularize even more severely in this case, because there’s an infinite number of ways to overfit the training data. What happens if n > d? Now we have more data than features, so the more data we have, the less likely we are to overfit the training set. In this case, we will not be able to find an exact solution: a θ that satisfies eq. 1.1. Instead, we will look for a θ that is best in the least-squares sense. It turns out, as we will see later in this document, that this θ can be obtained by solving the modified system of equations: X T X θ = X T y (1.5) This is called the normal equations. Notice that X T X is a square d ×d, invertible matrix 1 . We can now solve for θ as follows: θ = (X T X )−1 X T y (1.6) Where (X T X )−1 X T is often denoted as X + and is called the pseudoinverse of X . 1An easy-to-follow proof is provided here: <https://www.khanacademy.org/math/linear-algebra/ matrix_transformations/matrix_transpose/v/lin-alg-showing-that-a-transpose-x-a-is-invertible> 3
  • 4. Let us now arrive at this solution by direct optimization of the least squares objective: J(θ) = 1 n n t=1 1 2 y(t) −θ · x(t) 2 (1.7) Notice that this is just a way of summarizing the constraints in eq. 1.1. We want θ·x(t) to come as close as possible to y(t) for all t, so we minimize the sum of squared errors. Note: the 1 2 term is only for notational convenience (for taking derivatives). It is just a constant scaling factor of the final θ we obtain. Also, the 1 n term is not important for this form of re- gression without regularization, as it will cancel out. In the form with regularization, it just rescales the regularization parameter by the amount of data points n (we’ll see this later in this document). In any case, you might see formulations of regression with or without this term, but this will not make a big difference to the general form of the problem. We want to minimize J(θ), and so we set the gradient of this function to zero: ∇θ(J(θ)) = 0. This is equivalent to solving the following system of linear equations:        ∂ ∂θ1 (J(θ)) ∂ ∂θ2 (J(θ)) ... ∂ ∂θd (J(θ))        =       0 0 ... 0       (1.8) Let’s consider a single one of these equations: i.e. solve for a particular partial ∂ ∂θi (J(θ)). For clarity, consider expanding the vector notation in 1.7: J(θ) = 1 n n t=1 1 2 y(t) − x(t) 1 θ1 + x(t) 2 θ2 +···+ x(t) d θd 2 (1.9) Thus, by the chain rule: ∂ ∂θi (J(θ)) = 1 n n t=1 y(t) − x(t) 1 θ1 + x(t) 2 θ2 +···+ x(t) d θd −x(t) i (1.10) = 1 n n t=1 y(t) − x(t) T θ −x(t) i (1.11) Note: we just rewrote the dot product θ · x(t) in equivalent matrix form: x(t) T θ because we will be putting our whole system of equations in matrix form in the following calculations. Since we want to set all the partials to 0, it follows that: 1 n n t=1 y(t) − x(t) T θ −x(t) i = 0 (1.12) 1 n n t=1 x(t) i x(t) T θ = 1 n n t=1 x(t) i y(t) (1.13) 4
  • 5. So we can rewrite the system in eq. 1.14 as:          1 n n t=1 x(t) 1 x(t) T θ 1 n n t=1 x(t) 2 x(t) T θ ... 1 n n t=1 x(t) d x(t) T θ          =       1 n n t=1 x(t) 1 y(t) 1 n n t=1 x(t) 2 y(t) ... 1 n n t=1 x(t) d y(t)       (1.14) Which is equivalent to the following condensed form: 1 n n t=1 x(t) x(t) T θ = 1 n n t=1 x(t) y(t) (1.15) Dropping the 1 n term: n t=1 x(t) x(t) T θ = n t=1 x(t) y(t) (1.16) In other words, this equation is what we obtain by setting ∇θ(J(θ)) = 0. Let us write this equation out explicitly to see how we can rewrite it further in matrix form. First, consider writing out the left-hand-size of eq. 1.16:       x(1) 1 x(1) 2 ... x(1) d       x(1) 1 x(1) 2 ...x(1) d θ +       x(2) 1 x(2) 2 ... x(2) d       x(2) 1 x(2) 2 ...x(2) d θ +...+       x(n) 1 x(n) 2 ... x(n) d       x(n) 1 x(n) 2 ...x(n) d θ (1.17) Convince yourself that this is just:   x(1) x(2) ... x(n)          x(1) x(2) ... x(n)       θ (1.18) These matrices should be familiar from before. We can rewrite eq. 1.18 as X T X θ. Next, consider writing out the right-hand-size of eq. 1.16:       x(1) 1 x(1) 2 ... x(1) d       y(1) +       x(2) 1 x(2) 2 ... x(2) d       y(2) +...+       x(n) 1 x(n) 2 ... x(n) d       y(n) (1.19) 5
  • 6. Convince yourself that this is just:   x(1) x(2) ... x(n)    y (1.20) Which is nothing more than X T y. Putting together eq. 1.18 and eq. 1.20, we get exactly eq. 1.5, and so: θ = (X T X )−1 X T y as the least-squares solution of our regression problem (no offset, no regularization). 2 REGRESSION WITH REGULARIZATION If we overfit our training data, our predictions may not generalize very well to novel test data. For instance, if our training data is somewhat noisy (real measurements are always somewhat noisy!), we don’t want to fit our training data perfectly - otherwise we might generate very bad predictions for real signal. It is worse to overshoot than undershoot predictions when errors are measured by squared error. Being more guarded in how you predict, i.e., keeping θ small, helps reduce generalization error. We can enforce the constraint to keep θ small by adding a regularization term to 1.7: J(θ) = 1 n n t=1 1 2 y(t) −θ · x(t) 2 + λ 2 θ 2 (2.1) Since λ 2 θ 2 = λ 2 θ2 1 + λ 2 θ2 2 +···+ λ 2 θ2 d this contributes a single additional term to eq. 1.11: ∂ ∂θi (J(θ)) = 1 n n t=1 y(t) − x(t) T θ −x(t) i +λθi (2.2) So similarly to eq. 1.13, setting the partial to zero: 1 n n t=1 x(t) i x(t) T θ+λθi = 1 n n t=1 x(t) i y(t) (2.3) Thus, collapsing the equations for all partials: 1 n n t=1 x(t) x(t) T θ+λθ = 1 n n t=1 x(t) y(t) (2.4) And in matrix form: 6
  • 7. 1 n X T X θ+λθ = 1 n X T y (2.5) 1 n X T X +λI θ = 1 n X T y (2.6) (2.7) Which gives us the solution of our least-squares regression problem with regularization: θ = 1 n X T X +λI −1 1 n X T y Note: You can see that the 1 n term was not dropped in the regularized form of the problem, but is equivalent to a rescaling of λ. Aside: how do we know 1 n X T X +λI is invertible? Here is a rough proof outline: • First note that X T X is positive semidefinite proof: uT X T Xu = ||Xu||2 ≥ 0 so all eigenvalues must be ≥ 0 • The eigenvalues of X T X +λI are µi +λ, where µi are eigenvalues of X T X proof: X T Xu = µu implies (X T X +λI)u = (µ+λ)u • All eigenvalues of X T X are strictly positive, so it must be invertible proof: λ > 0, and so µi +λ > 0 for every i 3 REGRESSION WITH REGULARIZATION AND OFFSET The offset parameter can be thought of as adjusting to the magnitude of the data. It is a single scalar, and we do not want to penalize its size during regularization, because a large offset might allow many of the other parameters to be much smaller and still provide a good fit for the data. The calculation for the least-squares solution with regularization and offset is similar to the calculation without offset. J(θ) = 1 n n t=1 1 2 y(t) − θ · x(t) +θo 2 + λ 2 θ 2 (3.1) Setting ∇θ(J(θ)) = 0 as before, we have: 1 n n t=1 x(t) x(t) T θ+ 1 n θo n t=1 x(t) +λθ = 1 n n t=1 x(t) y(t) (3.2) And in matrix form: 7
  • 8. 1 n X T X θ+ 1 n θo X T 1+λθ = 1 n X T y (3.3) 1 n X T X +λ θ + 1 n X T 1 θo = 1 n X T y (3.4) Where 1 is the vector composed of all ones (here with dimension n×1), and so X T 1 = n t=1 x(t) . We separately minimize J(θ) with respect to θo (which does not appear in the regularizer): ∂ ∂θo (J(θ)) = 1 n n t=1 y(t) − x(t) T θ −θo (−1) (3.5) Setting this partial to 0 we get: θo = 1 n n t=1 y(t) − x(t) T θ (3.6) = 1 n yT 1− 1 n X θ T 1 (3.7) Note that we now have 2 equations (3.4 and 3.7) in 2 unknowns: θ and θo that can be solved algebraically. 4 ADDITIONAL RESOURCES • Matrix cookbook: <http://www.mit.edu/~wingated/stuff_i_use/matrix_cookbook. pdf>. • Matrix approach to linear regression with a statistical perspective: <http://www.maths. qmul.ac.uk/~bb/SM_I_2013_LecturesWeek_6.pdf>. 8