SlideShare a Scribd company logo
1 of 17
Download to read offline
MACHINE LEARNING COLLECTION
THEORIE
Supervised Learning Problem
Linear Regression / Multiple Features
Marco Moldenhauer
May 31, 2017
Supervised Learning Problem
Linear Regression / Multiple Features
1 General
Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives com-
puters the ability to learn without being explicitly programmed." This is an older, informal definition. Tom Mitchell
provides a more modern definition: "A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experi-
ence E."
In supervised learning, we are given a data set and already know what our correct output should look like, having
the idea that there is a relationship between the input and the output. Supervised learning problems are categorized
into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a con-
tinuous output, meaning that we are trying to map input variables to some continuous function. In a classification
problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input
variables into discrete categories.
2
Supervised Learning Problem
Linear Regression / Multiple Features
2 Data Set
We define a input set Ω with one bias element x0 and multiple feature elements x1, x2, x3, ..., xj , ..., xn and a output
set Υ with one output element y1.
Ω = {x0;x1;x2;··· ;xj ;··· ;xn} (2.1)
Υ = {y1} (2.2)
An arbitrary element u ∈ Ω and v ∈ Υ are vectoriezed by m rows.
x0 =


















x(1)
0
x(2)
0
x(3)
0
...
x(i)
0
...
x(m)
0


















=


















1
1
1
...
1
...
1


















(2.3)
x1 =


















x(1)
1
x(2)
1
x(3)
1
...
x(i)
1
...
x(m)
1


















; x2 =


















x(1)
2
x(2)
2
x(3)
2
...
x(i)
2
...
x(m)
2


















; ··· ; xj ,=


















x(1)
j
x(2)
j
x(3)
j
...
x(i)
j
...
x(m)
j


















; ··· ; xn =


















x(1)
n
x(2)
n
x(3)
n
...
x(i)
n
...
x(m)
n


















(2.4)
y1 =


















y(1)
1
y(2)
1
y(3)
1
...
y(i)
1
...
y(m)
1


















(2.5)
We define a set Π with m tuple called training set.
Π = {(1,x(1)
1 ,x(1)
2 ,···,x(1)
j
,···,x(1)
n , y(1)
1 ); (2.6)
(1,x(2)
1 ,x(2)
2 ,···,x(2)
j
,···,x(2)
n , y(2)
1 );
(1,x(3)
1 ,x(3)
2 ,···,x(3)
j
,···,x(3)
n , y(3)
1 );
...
(1,x(i)
1 ,x(i)
2 ,···,x(i)
j
,···,x(i)
n , y(i)
1 );
...
(1,x(m)
1 ,x(m)
2 ,···,x(m)
j
,···,x(m)
n , y(m)
1 )}
3
Supervised Learning Problem
Linear Regression / Multiple Features
For a better intuition of the training set we can write it in a table called training table. Every row means one training
example. In total we have m training examples.
1 x(1)
1 x(1)
2 ... x(1)
j
... x(1)
n y(1)
1
1 x(2)
1 x(2)
2 ... x(2)
j
... x(2)
n y(2)
1
1 x(3)
1 x(3)
2 ... x(3)
j
... x(3)
n y(3)
1
...
...
...
...
...
...
...
...
1 x(i)
1 x(i)
2 ... x(i)
j
... x(i)
n y(i)
1
...
...
...
...
...
...
...
...
1 x(m)
1 x(m)
2 ... x(m)
j
... x(m)
n y(m)
1
Table 2.1 Training table
We define the discrete training function t(x(i)
1 ,x(i)
2 ,x(i)
3 ,··· ,x(i)
i
,··· ,x(i)
n ):
t : {(x(1)
1 ;x(1)
2 ;x(1)
3 ;... ;x(1)
j
;... ;x(1)
n ); → {y(1)
1 ; y(2)
1 ; y(3)
1 ;... ; y(i)
1 ;... ; y(m)
1 } (2.7)
(x(2)
1 ;x(2)
2 ;x(2)
3 ;... ;x(2)
j
;... ;x(2)
n );
(x(3)
1 ;x(3)
2 ;x(3)
3 ;... ;x(3)
j
;... ;x(3)
n );
...;
(x(i)
1 ;x(i)
2 ;x(i)
3 ;... ;x(i)
j
;... ;x(i)
n );
...;
(x(m)
1 ;x(m)
2 ;x(m)
3 ;... ;x(m)
j
;... ;x(m)
n )}
(x(i)
1 ;x(i)
2 ;x(i)
3 ;... ;x(i)
j
;... ;x(i)
n ) → y(i)
1
We define a matrix T called training matrix. We also define the input matrix X and the output matrix Y.
T =


















1 x(1)
1 x(1)
2 ··· x(1)
j
··· x(1)
n y(1)
1
1 x(2)
1 x(2)
2 ··· x(2)
j
··· x2)
n y(2)
1
1 x(3)
1 x(3)
2 ··· x(3)
j
··· x(3)
n y(3)
1
...
...
...
...
...
...
...
...
1 x(i)
1 x(i)
2 ··· x(i)
j
··· x(i)
n y(i)
1
...
...
...
...
...
...
...
...
1 x(m)
1 x(m)
2 ··· x(m)
j
··· x(m)
n y(m)
1


















; (2.8)
X =


















1 x(1)
1 x(1)
2 ··· x(1)
j
··· x(1)
n
1 x(2)
1 x(2)
2 ··· x(2)
j
··· x(2)
n
1 x(3)
1 x(3)
2 ··· x(3)
j
··· x(3)
n
...
...
...
...
...
...
...
1 x(i)
1 x(i)
2 ··· x(i)
j
··· x(i)
n
...
...
...
...
...
...
...
1 x(m)
1 x(m)
2 ··· x(m)
j
··· x(m)
n


















; Y =


















y(1)
1
y(2)
1
y(3)
1
...
y(i)
1
...
y(m)
1


















(2.9)
4
Supervised Learning Problem
Linear Regression / Multiple Features
In MATLAB we we load our data set and initialize variable T, X and Y
data = load(’example.txt’);
m = size(data,1);
z = size(data,2);
T = [ones(m, 1), data(:,1:z)];
X = [ones(m, 1), data(:,1:(z-1))];
Y = [data(:,z)];
5
Supervised Learning Problem
Linear Regression / Multiple Features
2.1 Feature Scaling
Feature scaling is a method used to standardize the range of independent features of data set. Since the range of
values of the features varies widely, in some machine learning algorithms, objective functions will not work properly
without standardization. Therefore, the range of all features should be standardize so that each feature contributes
approximately proportionately to the final distance. Another reason why feature scaling is applied is that gradient
descent converges much faster with feature scaling than without it. If feature scaling is needed for all xj ∈ Ω  x0 in
the data set, we define a set ˘Ω = {x0; ˘x1; ˘x2; ···; ˘xj ; ···; ˘xn } called standardize input set.
˘Ω = x0; ˘xj | ˘xj =
xj −µj
σj
, ∀xj ∈ Ω{x0},∀µj ∈ Rm
,∀σj ∈ R = {x0; ˘x1; ˘x2;··· ; ˘xj ;··· ; ˘xn} (2.10)
with the arithmetic mean vector µj and the standard deviation σj .
µj =


























1
m ·
m
i=1
x(i)
j
1
m ·
m
i=1
x(i)
j
1
m ·
m
i=1
x(i)
j
...
1
m ·
m
i=1
x(i)
j
...
1
m ·
m
i=1
x(i)
j


























(2.11)
σj =
1
N −1
·
m
i=1
x(i)
j
−
1
m
·
m
i=1
x(i)
j
(2.12)
Finally we re-define our training matrix T into a "standardize" training matrix and our input matrix X into a "stan-
dardize" input matrix.
T =


















1 ˘x(1)
1 ˘x(1)
2 ··· ˘x(1)
j
··· ˘x(1)
n y(1)
1
1 ˘x(2)
1 ˘x(2)
2 ··· ˘x(2)
j
··· ˘x(2)
n y(2)
1
1 ˘x(3)
1 ˘x(3)
2 ··· ˘x(3)
j
··· ˘x(3)
n y(3)
1
...
...
...
...
...
...
...
...
1 ˘x(i)
1 ˘x(i)
2 ··· ˘x(i)
j
··· ˘x(i)
n y(i)
1
...
...
...
...
...
...
...
...
1 ˘x(m)
1 ˘x(m)
2 ··· ˘x(m)
j
··· ˘x(m)
n y(m)
1


















; (2.13)
X =


















1 ˘x(1)
1 ˘x(1)
2 ··· ˘x(1)
j
··· ˘x(1)
n
1 ˘x(2)
1 ˘x(2)
2 ··· ˘x(2)
j
··· ˘x(2)
n
1 ˘x(3)
1 ˘x(3)
2 ··· ˘x(3)
j
··· ˘x(3)
n
...
...
...
...
...
...
...
1 ˘x(i)
1 ˘x(i)
2 ··· ˘x(i)
j
··· ˘x(i)
n
...
...
...
...
...
...
...
1 ˘x(m)
1 ˘x(m)
2 ··· ˘x(m)
j
··· ˘x(m)
n


















(2.14)
6
Supervised Learning Problem
Linear Regression / Multiple Features
In MATLAB we initialize the "standardize" input matrix X.
for j=2:size(X,2)
mu = mean(X(:,j)) * ones( size(X(:,j),1), 1);
sigma = std(X(:,j));
X(:,j) = ( X(:,j)-mu ) / sigma;
end
7
Supervised Learning Problem
Linear Regression / Multiple Features
3 Hypothesis Function
To describe the supervised learning problem - linear regression -, our goal is, given a training set Π, to learn a linear
function. We are defining the continious hypothesis function h(x1,x2,x3,··· ,xj ,··· ,xn):
h : Rn
→ R (3.1)
(x1,x2,x3,...,xj ,...,xn) → θ0 +θ1 · x1 +θ2 · x2 +θ3 · x3 +···+θj · xj +···+θn · xn
so that h(x1,x2,x3,··· ,xj ,··· ,xn) is a good predictor for n arbitraries input variables. For historical reasons the func-
tion is called hypothesis. In other words we are looking the best values for θ0, θ1, θ2, θ3, ..., θj , ..., θn. We define the
parameter vector θ
θ =





















θ0
θ1
θ2
θ3
...
θj
...
θn





















(3.2)
and the hypothesis-input vector x
x =





















1
x1
x2
x3
...
xj
...
xn





















(3.3)
So we can write our hypothesis function h(x1,x2,x3,··· ,xj ,··· ,xn) in vectorized form.
h(x1,x2,x3,···,xj ,···,xn) = θT
· x = θ0 θ1 θ2 θ3 ··· θj ··· θn ·





















1
x1
x2
x3
...
xj
...
xn





















(3.4)
= θ0 +θ1 · x1 +θ2 · x2 +θ3 · x3 +···+θj · xj +···+θn · xn
In MATLAB we initialize the parameter vector θ and start with a arbitrary value for θ0, θ1, θ2, θ3,..., θj ,..., θn (prefer-
ably equal 0).
theta = zeros(size(X,2),1);
8
Supervised Learning Problem
Linear Regression / Multiple Features
4 Cost Function
We can measure the accuracy of our hypothesis function h(x1,x2,x3,··· ,xj ,··· ,xn) by using a cost function J(θ0,θ1,
θ2,θ3,···,θj ,···,θn). This takes an average difference (actually a fancier version of an average) of all the results of
the hypothesis with inputs x(i)
1 ’s, x(i)
2 ’s, x(i)
3 ’s, ..., x(i)
j
’s, ..., x(i)
n ’s and the actual output y(i)
1 ’s. This function is other-
wise called the "Squared error function", or "Mean squared error". The mean is halved 1/2 as a convenience for the
computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.
J(θ0,θ1,θ2,θ3,···,θj ,···,θn) =
1
2·m
·
m
i=1
(h(x(i)
1 )− y(i)
1 )2
(4.1)
Other than in the journal "Supervised Learning Problem / Linear Regression / One Feature" disclam to sum out the
cost function. For this reason we don’t define any coefficient and we don’t plot the cost function.
In algebra we call this function a quadratic polynomial with (n + 1) variables. Note, that the cost function for linear
regression is always going to be a bowl-shaped function. Means, that J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) doesn’t have any
local optima except for the one global optima.
9
Supervised Learning Problem
Linear Regression / Multiple Features
In MATLAB we initialize the cost function J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) with our given input matrix X, given output
matrix Y and an arbitrary parameter vector θ
J = 1/(2*m) * sum( (X*theta - Y).ˆ2 );
10
Supervised Learning Problem
Linear Regression / Multiple Features
5 Gradient Descent Algorithm
So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to
estimate the parameters in the hypothesis function. That’s where gradient descent comes in. Imagine that we graph
our hypothesis function based on its parameter theta zero and theta one. Aactually we are graphing the cost function
as a function of the parameter estimates. We will know that we have succeeded when our cost function is at the very
bottom of the graph. The way we do this is by taking the derivative (the tangential line to a function) of our cost
function.The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We
make steps down the cost function in the direction with the steepest descent. The size of each step is determined by
the parameter α, which is called the learning rate. A smaller α would result in a smaller step and a larger α results
in a larger step. The direction in which the step is taken is determined by the partial derivative of the cost function
∂
∂θj
J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn). Depending on where the parameter vector θ starts on the graph.
With other words, we want an efficient algorithm to find the value of θ0, θ1, θ2, θ3, ..., θj , ..., θn. The gradient descent
algorithm is:
Keep changing θ0, θ1, θ2, θ3, ..., θj , ..., θn until we end up at the global minimum. Remember: We start at θ0 = 0,
θ1 = 0,θ2 = 0,θ3 = 0,···,θj = 0,···,θn = 0.
θtemp_0 := θ0 −α·
∂
∂θ0
J(θ0,θ1,θ2,···,θj ,···,θn) = θ0 −α·
1
m
·
m
i=1
(hθ(x(i)
1 )− y(i)
1 ) (5.1)
θtemp_1 := θ1 −α·
∂
∂θ1
J(θ0,θ1,θ2,···,θj ,···,θn) = θ1 −α·
1
m
·
m
i=1
(hθ(x(i)
1 )− y(i)
1 )· x(i)
1
θtemp_2 := θ2 −α·
∂
∂θ2
J(θ0,θ1,θ2,···,θj ,···,θn) = θ2 −α·
1
m
·
m
i=1
(hθ(x(i)
1 )− y(i)
1 )· x(i)
2
...
θtemp_j := θj −α·
∂
∂θj
J(θ0,θ1,θ2,···,θj ,···,θn) = θj −α·
1
m
·
m
i=1
(hθ(x(i)
1 )− y(i)
1 )· x(i)
j
...
θtemp_n := θn −α·
∂
∂θj
J(θ0,θ1,θ2,···,θj ,···,θn) = θn −α·
1
m
·
m
i=1
(hθ(x(i)
1 )− y(i)
1 )· x(i)
n
θ0 := θtemp_0
θ1 := θtemp_1
θ2 := θtemp_2
...
θj := θtemp_j
...
θn := θtemp_n
• if α is too small, gradient descent can be slow
• if α is too large, gradient descent can overshoot the minimum. It may be fail to converge, or even diverge
Gradient descent alorithm works sucsessfully if after n iteration steps the partial derivative of the cost function is zero.
α·
∂
∂θ0
J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; α·
∂
∂θ1
J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; (5.2)
α·
∂
∂θ2
J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; ··· ; α·
∂
∂θ2
J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; ··· ;
α·
∂
∂θ2
J(θ0,θ1,θ2,···,θj ,···,θn) = α·0
11
Supervised Learning Problem
Linear Regression / Multiple Features
In MATLAB we initialize the descent algorithm with our given input matrix X, given output matrix Y and an arbitrary
parameter vector θ. At the beginning we need to initialize the iteration steps iter and the learning rate alpha. To
check if our learning algorithm is working well we will calculate stepwise the value of our cost function. For this
reason we initialize a cost-converge-test function J_test. After the gradient descent algorithm is done we plot the
cost-converge-test function dependending on the iteration steps iters.
iter = 1000;
alpha = 0.01;
J_test = zeros(iter,1);
iters =[1:1:iter]’;
for k=1:iter
J_test(k) = 1/(2*m) * sum( (X*theta - Y).ˆ2 );
for j=1:size(X,2)
theta_temp(j) = theta(j) - alpha*1/m * sum( (X*theta - Y ).*X(:,j) );
end
theta = theta_temp;
end
plot(k,J_test)
disp([theta])
12
Supervised Learning Problem
Linear Regression / Multiple Features
6 Normal Equation Algorithm (Alternative)
Gradient descent gives one way of minimizing J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn). Let’s discuss a second way of doing so,
this time performing the minimization explicitly and without resorting to an iterative algorithm. In the "Normal
Equation" method, we will minimize J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) by explicitly taking its derivatives with respect to
the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation
formula is given below:
θ = (X ·X)−1
·X ·Y (6.1)
With the normal equation, computing the inversion has complexity O(n3
). So if we have a very large number of
features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a
normal solution to an iterative process.
In MATLAB we write
theta = inv(X’*X)*X’*Y;
13
Supervised Learning Problem
Linear Regression / Multiple Features
7 Important Conclusions
• Gradient Descent Algorithm
– Need to choose α
– Needs many iterations O(kn2
)
– Works well when n is large
– if α is too small, gradient descent can be slow
– if α is too large, gradient descent can overshoot the minimum. It may be fail to converge, or even diverge
• Normal Equation Algorithm
– No need to choose α
– No need to iterate
– O(n3
), need to calculate inverse of (X · X)
– Slow if n is very large
– When implementing the normal equation in MATLAB we want to use the pinv function rather than inv.
The pinv function will give you a value of θ even if (X · X) is not invertible.
– If (X · X) is noninvertible, the common causes might be having
* Redundant features, where two features are very closely related (i.e. they are linearly dependent)
* Too many features (e.g. m ≤ n). In this case, delete some features or use regularization (see other
journal).
• Feature Scaling
– If range of values of the features varies widely
– Gradient descent converges much faster with feature scaling than without it
14
Supervised Learning Problem
Linear Regression / Multiple Features
Glossary
Ω Input set
˘Ω Standardize input set
Υ Output set
µj Arithmetic mean vector
σj Standard deviation
x0 Bias element
x1 Feature 1 element
x2 Feature 2 element
x3 Feature 3 element
xj Feature j element
xn Feature n element
˘x1 Standardize feature 1 element
˘x2 Standardize feature 2 element
˘xj Standardize feature j element
˘xn Standardize feature n element
y1 Output element
u Arbitrary element in the input set
v Arbitrary element in the output set
m Number of feature examples
15
Supervised Learning Problem
Linear Regression / Multiple Features
Π Training set
t(x(i)
1 ,x(i)
2 ,x(i)
3 ,··· ,x(i)
i
,··· ,x(i)
n ) Training function (discrete)
T Training matrix
X Input matrix
Y Output matrix
h(x1,x2,x3,··· ,xj ,··· ,xn) Hypothesis function (continuous)
θ0 1-st element of the parameter vector
θ1 2-nd element of the parameter vector
θ2 3-rd element of the parameter vector
θ3 4-th element of the parameter vector
θj j-th element of the parameter vector
θn (n+1)-th element of the parameter vector
θ Parameter vector
x Hypothesis-input vector
J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) Cost function
x(i)
1 i-th feature 1 value
x(i)
2 i-th feature 2 value
x(i)
3 i-th feature 3 value
x(i)
j
i-th feature j value
16
Supervised Learning Problem
Linear Regression / Multiple Features
x(i)
n i-th feature n value
y(i)
1 i-th output value
α Learning rate
∂
∂θj
J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) Partial derivative of the cost function
X Transpose of the input matrix
n Number of feature elements
17

More Related Content

What's hot

On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...BRNSS Publication Hub
 
FEM problem of elasticity
FEM problem of elasticityFEM problem of elasticity
FEM problem of elasticityAshwani Jha
 
Conditional mixture model for modeling attributed dyadic data
Conditional mixture model for modeling attributed dyadic dataConditional mixture model for modeling attributed dyadic data
Conditional mixture model for modeling attributed dyadic dataLoc Nguyen
 
Machine learning (10)
Machine learning (10)Machine learning (10)
Machine learning (10)NYversity
 
Flip bifurcation and chaos control in discrete-time Prey-predator model
Flip bifurcation and chaos control in discrete-time Prey-predator model Flip bifurcation and chaos control in discrete-time Prey-predator model
Flip bifurcation and chaos control in discrete-time Prey-predator model irjes
 
Estimation of Parameters and Missing Responses In Second Order Response Surfa...
Estimation of Parameters and Missing Responses In Second Order Response Surfa...Estimation of Parameters and Missing Responses In Second Order Response Surfa...
Estimation of Parameters and Missing Responses In Second Order Response Surfa...inventionjournals
 
New Information on the Generalized Euler-Tricomi Equation
New Information on the Generalized Euler-Tricomi Equation New Information on the Generalized Euler-Tricomi Equation
New Information on the Generalized Euler-Tricomi Equation Lossian Barbosa Bacelar Miranda
 
On Continuous Approximate Solution of Ordinary Differential Equations
On Continuous Approximate Solution of Ordinary Differential EquationsOn Continuous Approximate Solution of Ordinary Differential Equations
On Continuous Approximate Solution of Ordinary Differential EquationsWaqas Tariq
 
Switkes01200543268
Switkes01200543268Switkes01200543268
Switkes01200543268Hitesh Wagle
 
Chapter 4 solving systems of nonlinear equations
Chapter 4 solving systems of nonlinear equationsChapter 4 solving systems of nonlinear equations
Chapter 4 solving systems of nonlinear equationsssuser53ee01
 
Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...
Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...
Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...ssuserd6b1fd
 
Moment closure inference for stochastic kinetic models
Moment closure inference for stochastic kinetic modelsMoment closure inference for stochastic kinetic models
Moment closure inference for stochastic kinetic modelsColin Gillespie
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...mathsjournal
 

What's hot (17)

On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
 
Bayes gauss
Bayes gaussBayes gauss
Bayes gauss
 
FEM problem of elasticity
FEM problem of elasticityFEM problem of elasticity
FEM problem of elasticity
 
Conditional mixture model for modeling attributed dyadic data
Conditional mixture model for modeling attributed dyadic dataConditional mixture model for modeling attributed dyadic data
Conditional mixture model for modeling attributed dyadic data
 
6.7 airy
6.7 airy6.7 airy
6.7 airy
 
Machine learning (10)
Machine learning (10)Machine learning (10)
Machine learning (10)
 
Flip bifurcation and chaos control in discrete-time Prey-predator model
Flip bifurcation and chaos control in discrete-time Prey-predator model Flip bifurcation and chaos control in discrete-time Prey-predator model
Flip bifurcation and chaos control in discrete-time Prey-predator model
 
Estimation of Parameters and Missing Responses In Second Order Response Surfa...
Estimation of Parameters and Missing Responses In Second Order Response Surfa...Estimation of Parameters and Missing Responses In Second Order Response Surfa...
Estimation of Parameters and Missing Responses In Second Order Response Surfa...
 
New Information on the Generalized Euler-Tricomi Equation
New Information on the Generalized Euler-Tricomi Equation New Information on the Generalized Euler-Tricomi Equation
New Information on the Generalized Euler-Tricomi Equation
 
Lagrange’s interpolation formula
Lagrange’s interpolation formulaLagrange’s interpolation formula
Lagrange’s interpolation formula
 
On Continuous Approximate Solution of Ordinary Differential Equations
On Continuous Approximate Solution of Ordinary Differential EquationsOn Continuous Approximate Solution of Ordinary Differential Equations
On Continuous Approximate Solution of Ordinary Differential Equations
 
Switkes01200543268
Switkes01200543268Switkes01200543268
Switkes01200543268
 
Chapter 4 solving systems of nonlinear equations
Chapter 4 solving systems of nonlinear equationsChapter 4 solving systems of nonlinear equations
Chapter 4 solving systems of nonlinear equations
 
Numerical Solution and Stability Analysis of Huxley Equation
Numerical Solution and Stability Analysis of Huxley EquationNumerical Solution and Stability Analysis of Huxley Equation
Numerical Solution and Stability Analysis of Huxley Equation
 
Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...
Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...
Think Like Scilab and Become a Numerical Programming Expert- Notes for Beginn...
 
Moment closure inference for stochastic kinetic models
Moment closure inference for stochastic kinetic modelsMoment closure inference for stochastic kinetic models
Moment closure inference for stochastic kinetic models
 
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
DISCRETIZATION OF A MATHEMATICAL MODEL FOR TUMOR-IMMUNE SYSTEM INTERACTION WI...
 

Similar to X02 Supervised learning problem linear regression multiple features

Estimation of Parameters and Missing Responses In Second Order Response Surfa...
Estimation of Parameters and Missing Responses In Second Order Response Surfa...Estimation of Parameters and Missing Responses In Second Order Response Surfa...
Estimation of Parameters and Missing Responses In Second Order Response Surfa...inventionjournals
 
Exponential Like Function for Rational Indices
Exponential Like Function for Rational IndicesExponential Like Function for Rational Indices
Exponential Like Function for Rational IndicesSrini Chopakatla, PE, PMP
 
Chapter 4 2022.pdf
Chapter 4 2022.pdfChapter 4 2022.pdf
Chapter 4 2022.pdfMohamed Ali
 
On the Family of Concept Forming Operators in Polyadic FCA
On the Family of Concept Forming Operators in Polyadic FCAOn the Family of Concept Forming Operators in Polyadic FCA
On the Family of Concept Forming Operators in Polyadic FCADmitrii Ignatov
 
An Approach For Solving Nonlinear Programming Problems
An Approach For Solving Nonlinear Programming ProblemsAn Approach For Solving Nonlinear Programming Problems
An Approach For Solving Nonlinear Programming ProblemsMary Montoya
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksStratio
 
1 TRABAJO DE MATEMÁTICA *** UTAEB ***
1 TRABAJO DE MATEMÁTICA  *** UTAEB *** 1 TRABAJO DE MATEMÁTICA  *** UTAEB ***
1 TRABAJO DE MATEMÁTICA *** UTAEB *** JhonatanMedina15
 
Numerical Methods
Numerical MethodsNumerical Methods
Numerical MethodsTeja Ande
 
1) Trabajo de Matemática UTAEB 2021
1) Trabajo de Matemática UTAEB 20211) Trabajo de Matemática UTAEB 2021
1) Trabajo de Matemática UTAEB 2021redinsonalvarado
 
Fuzzy calculation
Fuzzy calculationFuzzy calculation
Fuzzy calculationAmir Rafati
 
07 chap3
07 chap307 chap3
07 chap3ELIMENG
 
Finite Element Method.ppt
Finite Element Method.pptFinite Element Method.ppt
Finite Element Method.pptwerom2
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsAlexander Litvinenko
 
Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"壮 八幡
 

Similar to X02 Supervised learning problem linear regression multiple features (20)

Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
Estimation of Parameters and Missing Responses In Second Order Response Surfa...
Estimation of Parameters and Missing Responses In Second Order Response Surfa...Estimation of Parameters and Missing Responses In Second Order Response Surfa...
Estimation of Parameters and Missing Responses In Second Order Response Surfa...
 
Exponential Like Function for Rational Indices
Exponential Like Function for Rational IndicesExponential Like Function for Rational Indices
Exponential Like Function for Rational Indices
 
Krishna
KrishnaKrishna
Krishna
 
Chapter 4 2022.pdf
Chapter 4 2022.pdfChapter 4 2022.pdf
Chapter 4 2022.pdf
 
On the Family of Concept Forming Operators in Polyadic FCA
On the Family of Concept Forming Operators in Polyadic FCAOn the Family of Concept Forming Operators in Polyadic FCA
On the Family of Concept Forming Operators in Polyadic FCA
 
05_AJMS_332_21.pdf
05_AJMS_332_21.pdf05_AJMS_332_21.pdf
05_AJMS_332_21.pdf
 
An Approach For Solving Nonlinear Programming Problems
An Approach For Solving Nonlinear Programming ProblemsAn Approach For Solving Nonlinear Programming Problems
An Approach For Solving Nonlinear Programming Problems
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
1 TRABAJO DE MATEMÁTICA *** UTAEB ***
1 TRABAJO DE MATEMÁTICA  *** UTAEB *** 1 TRABAJO DE MATEMÁTICA  *** UTAEB ***
1 TRABAJO DE MATEMÁTICA *** UTAEB ***
 
Metodo gauss_newton.pdf
Metodo gauss_newton.pdfMetodo gauss_newton.pdf
Metodo gauss_newton.pdf
 
5.n nmodels i
5.n nmodels i5.n nmodels i
5.n nmodels i
 
Numerical Methods
Numerical MethodsNumerical Methods
Numerical Methods
 
1) Trabajo de Matemática UTAEB 2021
1) Trabajo de Matemática UTAEB 20211) Trabajo de Matemática UTAEB 2021
1) Trabajo de Matemática UTAEB 2021
 
Fuzzy calculation
Fuzzy calculationFuzzy calculation
Fuzzy calculation
 
07 chap3
07 chap307 chap3
07 chap3
 
Finite Element Method.ppt
Finite Element Method.pptFinite Element Method.ppt
Finite Element Method.ppt
 
Slides erasmus
Slides erasmusSlides erasmus
Slides erasmus
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
 
Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"Paper Introduction "Density-aware person detection and tracking in crowds"
Paper Introduction "Density-aware person detection and tracking in crowds"
 

Recently uploaded

Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 

Recently uploaded (20)

Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 

X02 Supervised learning problem linear regression multiple features

  • 1. MACHINE LEARNING COLLECTION THEORIE Supervised Learning Problem Linear Regression / Multiple Features Marco Moldenhauer May 31, 2017
  • 2. Supervised Learning Problem Linear Regression / Multiple Features 1 General Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives com- puters the ability to learn without being explicitly programmed." This is an older, informal definition. Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experi- ence E." In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. Supervised learning problems are categorized into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a con- tinuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories. 2
  • 3. Supervised Learning Problem Linear Regression / Multiple Features 2 Data Set We define a input set Ω with one bias element x0 and multiple feature elements x1, x2, x3, ..., xj , ..., xn and a output set Υ with one output element y1. Ω = {x0;x1;x2;··· ;xj ;··· ;xn} (2.1) Υ = {y1} (2.2) An arbitrary element u ∈ Ω and v ∈ Υ are vectoriezed by m rows. x0 =                   x(1) 0 x(2) 0 x(3) 0 ... x(i) 0 ... x(m) 0                   =                   1 1 1 ... 1 ... 1                   (2.3) x1 =                   x(1) 1 x(2) 1 x(3) 1 ... x(i) 1 ... x(m) 1                   ; x2 =                   x(1) 2 x(2) 2 x(3) 2 ... x(i) 2 ... x(m) 2                   ; ··· ; xj ,=                   x(1) j x(2) j x(3) j ... x(i) j ... x(m) j                   ; ··· ; xn =                   x(1) n x(2) n x(3) n ... x(i) n ... x(m) n                   (2.4) y1 =                   y(1) 1 y(2) 1 y(3) 1 ... y(i) 1 ... y(m) 1                   (2.5) We define a set Π with m tuple called training set. Π = {(1,x(1) 1 ,x(1) 2 ,···,x(1) j ,···,x(1) n , y(1) 1 ); (2.6) (1,x(2) 1 ,x(2) 2 ,···,x(2) j ,···,x(2) n , y(2) 1 ); (1,x(3) 1 ,x(3) 2 ,···,x(3) j ,···,x(3) n , y(3) 1 ); ... (1,x(i) 1 ,x(i) 2 ,···,x(i) j ,···,x(i) n , y(i) 1 ); ... (1,x(m) 1 ,x(m) 2 ,···,x(m) j ,···,x(m) n , y(m) 1 )} 3
  • 4. Supervised Learning Problem Linear Regression / Multiple Features For a better intuition of the training set we can write it in a table called training table. Every row means one training example. In total we have m training examples. 1 x(1) 1 x(1) 2 ... x(1) j ... x(1) n y(1) 1 1 x(2) 1 x(2) 2 ... x(2) j ... x(2) n y(2) 1 1 x(3) 1 x(3) 2 ... x(3) j ... x(3) n y(3) 1 ... ... ... ... ... ... ... ... 1 x(i) 1 x(i) 2 ... x(i) j ... x(i) n y(i) 1 ... ... ... ... ... ... ... ... 1 x(m) 1 x(m) 2 ... x(m) j ... x(m) n y(m) 1 Table 2.1 Training table We define the discrete training function t(x(i) 1 ,x(i) 2 ,x(i) 3 ,··· ,x(i) i ,··· ,x(i) n ): t : {(x(1) 1 ;x(1) 2 ;x(1) 3 ;... ;x(1) j ;... ;x(1) n ); → {y(1) 1 ; y(2) 1 ; y(3) 1 ;... ; y(i) 1 ;... ; y(m) 1 } (2.7) (x(2) 1 ;x(2) 2 ;x(2) 3 ;... ;x(2) j ;... ;x(2) n ); (x(3) 1 ;x(3) 2 ;x(3) 3 ;... ;x(3) j ;... ;x(3) n ); ...; (x(i) 1 ;x(i) 2 ;x(i) 3 ;... ;x(i) j ;... ;x(i) n ); ...; (x(m) 1 ;x(m) 2 ;x(m) 3 ;... ;x(m) j ;... ;x(m) n )} (x(i) 1 ;x(i) 2 ;x(i) 3 ;... ;x(i) j ;... ;x(i) n ) → y(i) 1 We define a matrix T called training matrix. We also define the input matrix X and the output matrix Y. T =                   1 x(1) 1 x(1) 2 ··· x(1) j ··· x(1) n y(1) 1 1 x(2) 1 x(2) 2 ··· x(2) j ··· x2) n y(2) 1 1 x(3) 1 x(3) 2 ··· x(3) j ··· x(3) n y(3) 1 ... ... ... ... ... ... ... ... 1 x(i) 1 x(i) 2 ··· x(i) j ··· x(i) n y(i) 1 ... ... ... ... ... ... ... ... 1 x(m) 1 x(m) 2 ··· x(m) j ··· x(m) n y(m) 1                   ; (2.8) X =                   1 x(1) 1 x(1) 2 ··· x(1) j ··· x(1) n 1 x(2) 1 x(2) 2 ··· x(2) j ··· x(2) n 1 x(3) 1 x(3) 2 ··· x(3) j ··· x(3) n ... ... ... ... ... ... ... 1 x(i) 1 x(i) 2 ··· x(i) j ··· x(i) n ... ... ... ... ... ... ... 1 x(m) 1 x(m) 2 ··· x(m) j ··· x(m) n                   ; Y =                   y(1) 1 y(2) 1 y(3) 1 ... y(i) 1 ... y(m) 1                   (2.9) 4
  • 5. Supervised Learning Problem Linear Regression / Multiple Features In MATLAB we we load our data set and initialize variable T, X and Y data = load(’example.txt’); m = size(data,1); z = size(data,2); T = [ones(m, 1), data(:,1:z)]; X = [ones(m, 1), data(:,1:(z-1))]; Y = [data(:,z)]; 5
  • 6. Supervised Learning Problem Linear Regression / Multiple Features 2.1 Feature Scaling Feature scaling is a method used to standardize the range of independent features of data set. Since the range of values of the features varies widely, in some machine learning algorithms, objective functions will not work properly without standardization. Therefore, the range of all features should be standardize so that each feature contributes approximately proportionately to the final distance. Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it. If feature scaling is needed for all xj ∈ Ω x0 in the data set, we define a set ˘Ω = {x0; ˘x1; ˘x2; ···; ˘xj ; ···; ˘xn } called standardize input set. ˘Ω = x0; ˘xj | ˘xj = xj −µj σj , ∀xj ∈ Ω{x0},∀µj ∈ Rm ,∀σj ∈ R = {x0; ˘x1; ˘x2;··· ; ˘xj ;··· ; ˘xn} (2.10) with the arithmetic mean vector µj and the standard deviation σj . µj =                           1 m · m i=1 x(i) j 1 m · m i=1 x(i) j 1 m · m i=1 x(i) j ... 1 m · m i=1 x(i) j ... 1 m · m i=1 x(i) j                           (2.11) σj = 1 N −1 · m i=1 x(i) j − 1 m · m i=1 x(i) j (2.12) Finally we re-define our training matrix T into a "standardize" training matrix and our input matrix X into a "stan- dardize" input matrix. T =                   1 ˘x(1) 1 ˘x(1) 2 ··· ˘x(1) j ··· ˘x(1) n y(1) 1 1 ˘x(2) 1 ˘x(2) 2 ··· ˘x(2) j ··· ˘x(2) n y(2) 1 1 ˘x(3) 1 ˘x(3) 2 ··· ˘x(3) j ··· ˘x(3) n y(3) 1 ... ... ... ... ... ... ... ... 1 ˘x(i) 1 ˘x(i) 2 ··· ˘x(i) j ··· ˘x(i) n y(i) 1 ... ... ... ... ... ... ... ... 1 ˘x(m) 1 ˘x(m) 2 ··· ˘x(m) j ··· ˘x(m) n y(m) 1                   ; (2.13) X =                   1 ˘x(1) 1 ˘x(1) 2 ··· ˘x(1) j ··· ˘x(1) n 1 ˘x(2) 1 ˘x(2) 2 ··· ˘x(2) j ··· ˘x(2) n 1 ˘x(3) 1 ˘x(3) 2 ··· ˘x(3) j ··· ˘x(3) n ... ... ... ... ... ... ... 1 ˘x(i) 1 ˘x(i) 2 ··· ˘x(i) j ··· ˘x(i) n ... ... ... ... ... ... ... 1 ˘x(m) 1 ˘x(m) 2 ··· ˘x(m) j ··· ˘x(m) n                   (2.14) 6
  • 7. Supervised Learning Problem Linear Regression / Multiple Features In MATLAB we initialize the "standardize" input matrix X. for j=2:size(X,2) mu = mean(X(:,j)) * ones( size(X(:,j),1), 1); sigma = std(X(:,j)); X(:,j) = ( X(:,j)-mu ) / sigma; end 7
  • 8. Supervised Learning Problem Linear Regression / Multiple Features 3 Hypothesis Function To describe the supervised learning problem - linear regression -, our goal is, given a training set Π, to learn a linear function. We are defining the continious hypothesis function h(x1,x2,x3,··· ,xj ,··· ,xn): h : Rn → R (3.1) (x1,x2,x3,...,xj ,...,xn) → θ0 +θ1 · x1 +θ2 · x2 +θ3 · x3 +···+θj · xj +···+θn · xn so that h(x1,x2,x3,··· ,xj ,··· ,xn) is a good predictor for n arbitraries input variables. For historical reasons the func- tion is called hypothesis. In other words we are looking the best values for θ0, θ1, θ2, θ3, ..., θj , ..., θn. We define the parameter vector θ θ =                      θ0 θ1 θ2 θ3 ... θj ... θn                      (3.2) and the hypothesis-input vector x x =                      1 x1 x2 x3 ... xj ... xn                      (3.3) So we can write our hypothesis function h(x1,x2,x3,··· ,xj ,··· ,xn) in vectorized form. h(x1,x2,x3,···,xj ,···,xn) = θT · x = θ0 θ1 θ2 θ3 ··· θj ··· θn ·                      1 x1 x2 x3 ... xj ... xn                      (3.4) = θ0 +θ1 · x1 +θ2 · x2 +θ3 · x3 +···+θj · xj +···+θn · xn In MATLAB we initialize the parameter vector θ and start with a arbitrary value for θ0, θ1, θ2, θ3,..., θj ,..., θn (prefer- ably equal 0). theta = zeros(size(X,2),1); 8
  • 9. Supervised Learning Problem Linear Regression / Multiple Features 4 Cost Function We can measure the accuracy of our hypothesis function h(x1,x2,x3,··· ,xj ,··· ,xn) by using a cost function J(θ0,θ1, θ2,θ3,···,θj ,···,θn). This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs x(i) 1 ’s, x(i) 2 ’s, x(i) 3 ’s, ..., x(i) j ’s, ..., x(i) n ’s and the actual output y(i) 1 ’s. This function is other- wise called the "Squared error function", or "Mean squared error". The mean is halved 1/2 as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term. J(θ0,θ1,θ2,θ3,···,θj ,···,θn) = 1 2·m · m i=1 (h(x(i) 1 )− y(i) 1 )2 (4.1) Other than in the journal "Supervised Learning Problem / Linear Regression / One Feature" disclam to sum out the cost function. For this reason we don’t define any coefficient and we don’t plot the cost function. In algebra we call this function a quadratic polynomial with (n + 1) variables. Note, that the cost function for linear regression is always going to be a bowl-shaped function. Means, that J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) doesn’t have any local optima except for the one global optima. 9
  • 10. Supervised Learning Problem Linear Regression / Multiple Features In MATLAB we initialize the cost function J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) with our given input matrix X, given output matrix Y and an arbitrary parameter vector θ J = 1/(2*m) * sum( (X*theta - Y).ˆ2 ); 10
  • 11. Supervised Learning Problem Linear Regression / Multiple Features 5 Gradient Descent Algorithm So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That’s where gradient descent comes in. Imagine that we graph our hypothesis function based on its parameter theta zero and theta one. Aactually we are graphing the cost function as a function of the parameter estimates. We will know that we have succeeded when our cost function is at the very bottom of the graph. The way we do this is by taking the derivative (the tangential line to a function) of our cost function.The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate. A smaller α would result in a smaller step and a larger α results in a larger step. The direction in which the step is taken is determined by the partial derivative of the cost function ∂ ∂θj J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn). Depending on where the parameter vector θ starts on the graph. With other words, we want an efficient algorithm to find the value of θ0, θ1, θ2, θ3, ..., θj , ..., θn. The gradient descent algorithm is: Keep changing θ0, θ1, θ2, θ3, ..., θj , ..., θn until we end up at the global minimum. Remember: We start at θ0 = 0, θ1 = 0,θ2 = 0,θ3 = 0,···,θj = 0,···,θn = 0. θtemp_0 := θ0 −α· ∂ ∂θ0 J(θ0,θ1,θ2,···,θj ,···,θn) = θ0 −α· 1 m · m i=1 (hθ(x(i) 1 )− y(i) 1 ) (5.1) θtemp_1 := θ1 −α· ∂ ∂θ1 J(θ0,θ1,θ2,···,θj ,···,θn) = θ1 −α· 1 m · m i=1 (hθ(x(i) 1 )− y(i) 1 )· x(i) 1 θtemp_2 := θ2 −α· ∂ ∂θ2 J(θ0,θ1,θ2,···,θj ,···,θn) = θ2 −α· 1 m · m i=1 (hθ(x(i) 1 )− y(i) 1 )· x(i) 2 ... θtemp_j := θj −α· ∂ ∂θj J(θ0,θ1,θ2,···,θj ,···,θn) = θj −α· 1 m · m i=1 (hθ(x(i) 1 )− y(i) 1 )· x(i) j ... θtemp_n := θn −α· ∂ ∂θj J(θ0,θ1,θ2,···,θj ,···,θn) = θn −α· 1 m · m i=1 (hθ(x(i) 1 )− y(i) 1 )· x(i) n θ0 := θtemp_0 θ1 := θtemp_1 θ2 := θtemp_2 ... θj := θtemp_j ... θn := θtemp_n • if α is too small, gradient descent can be slow • if α is too large, gradient descent can overshoot the minimum. It may be fail to converge, or even diverge Gradient descent alorithm works sucsessfully if after n iteration steps the partial derivative of the cost function is zero. α· ∂ ∂θ0 J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; α· ∂ ∂θ1 J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; (5.2) α· ∂ ∂θ2 J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; ··· ; α· ∂ ∂θ2 J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; ··· ; α· ∂ ∂θ2 J(θ0,θ1,θ2,···,θj ,···,θn) = α·0 11
  • 12. Supervised Learning Problem Linear Regression / Multiple Features In MATLAB we initialize the descent algorithm with our given input matrix X, given output matrix Y and an arbitrary parameter vector θ. At the beginning we need to initialize the iteration steps iter and the learning rate alpha. To check if our learning algorithm is working well we will calculate stepwise the value of our cost function. For this reason we initialize a cost-converge-test function J_test. After the gradient descent algorithm is done we plot the cost-converge-test function dependending on the iteration steps iters. iter = 1000; alpha = 0.01; J_test = zeros(iter,1); iters =[1:1:iter]’; for k=1:iter J_test(k) = 1/(2*m) * sum( (X*theta - Y).ˆ2 ); for j=1:size(X,2) theta_temp(j) = theta(j) - alpha*1/m * sum( (X*theta - Y ).*X(:,j) ); end theta = theta_temp; end plot(k,J_test) disp([theta]) 12
  • 13. Supervised Learning Problem Linear Regression / Multiple Features 6 Normal Equation Algorithm (Alternative) Gradient descent gives one way of minimizing J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn). Let’s discuss a second way of doing so, this time performing the minimization explicitly and without resorting to an iterative algorithm. In the "Normal Equation" method, we will minimize J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation formula is given below: θ = (X ·X)−1 ·X ·Y (6.1) With the normal equation, computing the inversion has complexity O(n3 ). So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process. In MATLAB we write theta = inv(X’*X)*X’*Y; 13
  • 14. Supervised Learning Problem Linear Regression / Multiple Features 7 Important Conclusions • Gradient Descent Algorithm – Need to choose α – Needs many iterations O(kn2 ) – Works well when n is large – if α is too small, gradient descent can be slow – if α is too large, gradient descent can overshoot the minimum. It may be fail to converge, or even diverge • Normal Equation Algorithm – No need to choose α – No need to iterate – O(n3 ), need to calculate inverse of (X · X) – Slow if n is very large – When implementing the normal equation in MATLAB we want to use the pinv function rather than inv. The pinv function will give you a value of θ even if (X · X) is not invertible. – If (X · X) is noninvertible, the common causes might be having * Redundant features, where two features are very closely related (i.e. they are linearly dependent) * Too many features (e.g. m ≤ n). In this case, delete some features or use regularization (see other journal). • Feature Scaling – If range of values of the features varies widely – Gradient descent converges much faster with feature scaling than without it 14
  • 15. Supervised Learning Problem Linear Regression / Multiple Features Glossary Ω Input set ˘Ω Standardize input set Υ Output set µj Arithmetic mean vector σj Standard deviation x0 Bias element x1 Feature 1 element x2 Feature 2 element x3 Feature 3 element xj Feature j element xn Feature n element ˘x1 Standardize feature 1 element ˘x2 Standardize feature 2 element ˘xj Standardize feature j element ˘xn Standardize feature n element y1 Output element u Arbitrary element in the input set v Arbitrary element in the output set m Number of feature examples 15
  • 16. Supervised Learning Problem Linear Regression / Multiple Features Π Training set t(x(i) 1 ,x(i) 2 ,x(i) 3 ,··· ,x(i) i ,··· ,x(i) n ) Training function (discrete) T Training matrix X Input matrix Y Output matrix h(x1,x2,x3,··· ,xj ,··· ,xn) Hypothesis function (continuous) θ0 1-st element of the parameter vector θ1 2-nd element of the parameter vector θ2 3-rd element of the parameter vector θ3 4-th element of the parameter vector θj j-th element of the parameter vector θn (n+1)-th element of the parameter vector θ Parameter vector x Hypothesis-input vector J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) Cost function x(i) 1 i-th feature 1 value x(i) 2 i-th feature 2 value x(i) 3 i-th feature 3 value x(i) j i-th feature j value 16
  • 17. Supervised Learning Problem Linear Regression / Multiple Features x(i) n i-th feature n value y(i) 1 i-th output value α Learning rate ∂ ∂θj J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) Partial derivative of the cost function X Transpose of the input matrix n Number of feature elements 17