X02 Supervised learning problem linear regression multiple features

MACHINE LEARNING COLLECTION
THEORIE
Supervised Learning Problem
Linear Regression / Multiple Features
Marco Moldenhauer
May 31, 2017

1 General
Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives com-
puters the ability to learn without being explicitly programmed." This is an older, informal definition. Tom Mitchell
provides a more modern definition: "A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experi-
ence E."
In supervised learning, we are given a data set and already know what our correct output should look like, having
the idea that there is a relationship between the input and the output. Supervised learning problems are categorized
into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a con-
tinuous output, meaning that we are trying to map input variables to some continuous function. In a classification
problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input
variables into discrete categories.
2

2 Data Set
We deﬁne a input set Ω with one bias element x0 and multiple feature elements x1, x2, x3, ..., xj , ..., xn and a output
set Υ with one output element y1.
Ω = {x0;x1;x2;··· ;xj ;··· ;xn} (2.1)
Υ = {y1} (2.2)
An arbitrary element u ∈ Ω and v ∈ Υ are vectoriezed by m rows.
x0 =


















x(1)
0
x(2)
0
x(3)
0
...
x(i)
0
...
x(m)
0


















=


















1
1
1
...
1
...
1


















(2.3)
x1 =


















x(1)
1
x(2)
1
x(3)
1
...
x(i)
1
...
x(m)
1


















; x2 =


















x(1)
2
x(2)
2
x(3)
2
...
x(i)
2
...
x(m)
2


















; ··· ; xj ,=


















x(1)
j
x(2)
j
x(3)
j
...
x(i)
j
...
x(m)
j


















; ··· ; xn =


















x(1)
n
x(2)
n
x(3)
n
...
x(i)
n
...
x(m)
n


















(2.4)
y1 =


















y(1)
1
y(2)
1
y(3)
1
...
y(i)
1
...
y(m)
1


















(2.5)
We deﬁne a set Π with m tuple called training set.
Π = {(1,x(1)
1 ,x(1)
2 ,···,x(1)
j
,···,x(1)
n , y(1)
1 ); (2.6)
(1,x(2)
1 ,x(2)
2 ,···,x(2)
j
,···,x(2)
n , y(2)
1 );
(1,x(3)
1 ,x(3)
2 ,···,x(3)
j
,···,x(3)
n , y(3)
1 );
...
(1,x(i)
1 ,x(i)
2 ,···,x(i)
j
,···,x(i)
n , y(i)
1 );
...
(1,x(m)
1 ,x(m)
2 ,···,x(m)
j
,···,x(m)
n , y(m)
1 )}
3

For a better intuition of the training set we can write it in a table called training table. Every row means one training
example. In total we have m training examples.
1 x(1)
1 x(1)
2 ... x(1)
j
... x(1)
n y(1)
1
1 x(2)
1 x(2)
2 ... x(2)
j
... x(2)
n y(2)
1
1 x(3)
1 x(3)
2 ... x(3)
j
... x(3)
n y(3)
1
...
...
...
...
...
...
...
...
1 x(i)
1 x(i)
2 ... x(i)
j
... x(i)
n y(i)
1
...
...
...
...
...
...
...
...
1 x(m)
1 x(m)
2 ... x(m)
j
... x(m)
n y(m)
1
Table 2.1 Training table
We define the discrete training function t(x(i)
1 ,x(i)
2 ,x(i)
3 ,··· ,x(i)
i
,··· ,x(i)
n ):
t : {(x(1)
1 ;x(1)
2 ;x(1)
3 ;... ;x(1)
j
;... ;x(1)
n ); → {y(1)
1 ; y(2)
1 ; y(3)
1 ;... ; y(i)
1 ;... ; y(m)
1 } (2.7)
(x(2)
1 ;x(2)
2 ;x(2)
3 ;... ;x(2)
j
;... ;x(2)
n );
(x(3)
1 ;x(3)
2 ;x(3)
3 ;... ;x(3)
j
;... ;x(3)
n );
...;
(x(i)
1 ;x(i)
2 ;x(i)
3 ;... ;x(i)
j
;... ;x(i)
n );
...;
(x(m)
1 ;x(m)
2 ;x(m)
3 ;... ;x(m)
j
;... ;x(m)
n )}
(x(i)
1 ;x(i)
2 ;x(i)
3 ;... ;x(i)
j
;... ;x(i)
n ) → y(i)
1
We define a matrix T called training matrix. We also define the input matrix X and the output matrix Y.
T =


















1 x(1)
1 x(1)
2 ··· x(1)
j
··· x(1)
n y(1)
1
1 x(2)
1 x(2)
2 ··· x(2)
j
··· x2)
n y(2)
1
1 x(3)
1 x(3)
2 ··· x(3)
j
··· x(3)
n y(3)
1
...
...
...
...
...
...
...
...
1 x(i)
1 x(i)
2 ··· x(i)
j
··· x(i)
n y(i)
1
...
...
...
...
...
...
...
...
1 x(m)
1 x(m)
2 ··· x(m)
j
··· x(m)
n y(m)
1


















; (2.8)
X =


















1 x(1)
1 x(1)
2 ··· x(1)
j
··· x(1)
n
1 x(2)
1 x(2)
2 ··· x(2)
j
··· x(2)
n
1 x(3)
1 x(3)
2 ··· x(3)
j
··· x(3)
n
...
...
...
...
...
...
...
1 x(i)
1 x(i)
2 ··· x(i)
j
··· x(i)
n
...
...
...
...
...
...
...
1 x(m)
1 x(m)
2 ··· x(m)
j
··· x(m)
n


















; Y =


















y(1)
1
y(2)
1
y(3)
1
...
y(i)
1
...
y(m)
1


















(2.9)
4

In MATLAB we we load our data set and initialize variable T, X and Y
data = load(’example.txt’);
m = size(data,1);
z = size(data,2);
T = [ones(m, 1), data(:,1:z)];
X = [ones(m, 1), data(:,1:(z-1))];
Y = [data(:,z)];
5

2.1 Feature Scaling
Feature scaling is a method used to standardize the range of independent features of data set. Since the range of
values of the features varies widely, in some machine learning algorithms, objective functions will not work properly
without standardization. Therefore, the range of all features should be standardize so that each feature contributes
approximately proportionately to the final distance. Another reason why feature scaling is applied is that gradient
descent converges much faster with feature scaling than without it. If feature scaling is needed for all xj ∈ Ω x0 in
the data set, we define a set ˘Ω = {x0; ˘x1; ˘x2; ···; ˘xj ; ···; ˘xn } called standardize input set.
˘Ω = x0; ˘xj | ˘xj =
xj −µj
σj
, ∀xj ∈ Ω{x0},∀µj ∈ Rm
,∀σj ∈ R = {x0; ˘x1; ˘x2;··· ; ˘xj ;··· ; ˘xn} (2.10)
with the arithmetic mean vector µj and the standard deviation σj .
µj =


























1
m ·
m
i=1
x(i)
j
1
m ·
m
i=1
x(i)
j
1
m ·
m
i=1
x(i)
j
...
1
m ·
m
i=1
x(i)
j
...
1
m ·
m
i=1
x(i)
j


























(2.11)
σj =
1
N −1
·
m
i=1
x(i)
j
−
1
m
·
m
i=1
x(i)
j
(2.12)
Finally we re-define our training matrix T into a "standardize" training matrix and our input matrix X into a "stan-
dardize" input matrix.
T =


















1 ˘x(1)
1 ˘x(1)
2 ··· ˘x(1)
j
··· ˘x(1)
n y(1)
1
1 ˘x(2)
1 ˘x(2)
2 ··· ˘x(2)
j
··· ˘x(2)
n y(2)
1
1 ˘x(3)
1 ˘x(3)
2 ··· ˘x(3)
j
··· ˘x(3)
n y(3)
1
...
...
...
...
...
...
...
...
1 ˘x(i)
1 ˘x(i)
2 ··· ˘x(i)
j
··· ˘x(i)
n y(i)
1
...
...
...
...
...
...
...
...
1 ˘x(m)
1 ˘x(m)
2 ··· ˘x(m)
j
··· ˘x(m)
n y(m)
1


















; (2.13)
X =


















1 ˘x(1)
1 ˘x(1)
2 ··· ˘x(1)
j
··· ˘x(1)
n
1 ˘x(2)
1 ˘x(2)
2 ··· ˘x(2)
j
··· ˘x(2)
n
1 ˘x(3)
1 ˘x(3)
2 ··· ˘x(3)
j
··· ˘x(3)
n
...
...
...
...
...
...
...
1 ˘x(i)
1 ˘x(i)
2 ··· ˘x(i)
j
··· ˘x(i)
n
...
...
...
...
...
...
...
1 ˘x(m)
1 ˘x(m)
2 ··· ˘x(m)
j
··· ˘x(m)
n


















(2.14)
6

In MATLAB we initialize the "standardize" input matrix X.
for j=2:size(X,2)
mu = mean(X(:,j)) * ones( size(X(:,j),1), 1);
sigma = std(X(:,j));
X(:,j) = ( X(:,j)-mu ) / sigma;
end
7

3 Hypothesis Function
To describe the supervised learning problem - linear regression -, our goal is, given a training set Π, to learn a linear
function. We are deﬁning the continious hypothesis function h(x1,x2,x3,··· ,xj ,··· ,xn):
h : Rn
→ R (3.1)
(x1,x2,x3,...,xj ,...,xn) → θ0 +θ1 · x1 +θ2 · x2 +θ3 · x3 +···+θj · xj +···+θn · xn
so that h(x1,x2,x3,··· ,xj ,··· ,xn) is a good predictor for n arbitraries input variables. For historical reasons the func-
tion is called hypothesis. In other words we are looking the best values for θ0, θ1, θ2, θ3, ..., θj , ..., θn. We deﬁne the
parameter vector θ
θ =





















θ0
θ1
θ2
θ3
...
θj
...
θn





















(3.2)
and the hypothesis-input vector x
x =





















1
x1
x2
x3
...
xj
...
xn





















(3.3)
So we can write our hypothesis function h(x1,x2,x3,··· ,xj ,··· ,xn) in vectorized form.
h(x1,x2,x3,···,xj ,···,xn) = θT
· x = θ0 θ1 θ2 θ3 ··· θj ··· θn ·





















1
x1
x2
x3
...
xj
...
xn





















(3.4)
= θ0 +θ1 · x1 +θ2 · x2 +θ3 · x3 +···+θj · xj +···+θn · xn
In MATLAB we initialize the parameter vector θ and start with a arbitrary value for θ0, θ1, θ2, θ3,..., θj ,..., θn (prefer-
ably equal 0).
theta = zeros(size(X,2),1);
8

4 Cost Function
We can measure the accuracy of our hypothesis function h(x1,x2,x3,··· ,xj ,··· ,xn) by using a cost function J(θ0,θ1,
θ2,θ3,···,θj ,···,θn). This takes an average difference (actually a fancier version of an average) of all the results of
the hypothesis with inputs x(i)
1 ’s, x(i)
2 ’s, x(i)
3 ’s, ..., x(i)
j
’s, ..., x(i)
n ’s and the actual output y(i)
1 ’s. This function is other-
wise called the "Squared error function", or "Mean squared error". The mean is halved 1/2 as a convenience for the
computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.
J(θ0,θ1,θ2,θ3,···,θj ,···,θn) =
1
2·m
·
m
i=1
(h(x(i)
1 )− y(i)
1 )2
(4.1)
Other than in the journal "Supervised Learning Problem / Linear Regression / One Feature" disclam to sum out the
cost function. For this reason we don’t deﬁne any coefﬁcient and we don’t plot the cost function.
In algebra we call this function a quadratic polynomial with (n + 1) variables. Note, that the cost function for linear
regression is always going to be a bowl-shaped function. Means, that J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) doesn’t have any
local optima except for the one global optima.
9

In MATLAB we initialize the cost function J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) with our given input matrix X, given output
matrix Y and an arbitrary parameter vector θ
J = 1/(2*m) * sum( (X*theta - Y).ˆ2 );
10

5 Gradient Descent Algorithm
So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to
estimate the parameters in the hypothesis function. That’s where gradient descent comes in. Imagine that we graph
our hypothesis function based on its parameter theta zero and theta one. Aactually we are graphing the cost function
as a function of the parameter estimates. We will know that we have succeeded when our cost function is at the very
bottom of the graph. The way we do this is by taking the derivative (the tangential line to a function) of our cost
function.The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We
make steps down the cost function in the direction with the steepest descent. The size of each step is determined by
the parameter α, which is called the learning rate. A smaller α would result in a smaller step and a larger α results
in a larger step. The direction in which the step is taken is determined by the partial derivative of the cost function
∂
∂θj
J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn). Depending on where the parameter vector θ starts on the graph.
With other words, we want an efficient algorithm to find the value of θ0, θ1, θ2, θ3, ..., θj , ..., θn. The gradient descent
algorithm is:
Keep changing θ0, θ1, θ2, θ3, ..., θj , ..., θn until we end up at the global minimum. Remember: We start at θ0 = 0,
θ1 = 0,θ2 = 0,θ3 = 0,···,θj = 0,···,θn = 0.
θtemp_0 := θ0 −α·
∂
∂θ0
J(θ0,θ1,θ2,···,θj ,···,θn) = θ0 −α·
1
m
·
m
i=1
(hθ(x(i)
1 )− y(i)
1 ) (5.1)
∂
∂θ1
J(θ0,θ1,θ2,···,θj ,···,θn) = θ1 −α·
1
m
·
m
i=1
(hθ(x(i)
1 )− y(i)
1 )· x(i)
1
∂
∂θ2
J(θ0,θ1,θ2,···,θj ,···,θn) = θ2 −α·
1
m
·
m
i=1
(hθ(x(i)
1 )− y(i)
1 )· x(i)
2
...
θtemp_j := θj −α·
∂
∂θj
J(θ0,θ1,θ2,···,θj ,···,θn) = θj −α·
1
m
·
m
i=1
(hθ(x(i)
1 )− y(i)
1 )· x(i)
j
...
θtemp_n := θn −α·
∂
∂θj
J(θ0,θ1,θ2,···,θj ,···,θn) = θn −α·
1
m
·
m
i=1
(hθ(x(i)
1 )− y(i)
1 )· x(i)
n
θ0 := θtemp_0
θ1 := θtemp_1
θ2 := θtemp_2
...
θj := θtemp_j
...
θn := θtemp_n
• if α is too small, gradient descent can be slow
• if α is too large, gradient descent can overshoot the minimum. It may be fail to converge, or even diverge
Gradient descent alorithm works sucsessfully if after n iteration steps the partial derivative of the cost function is zero.
α·
∂
∂θ0
J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; α·
∂
∂θ1
J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; (5.2)
α·
∂
∂θ2
J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; ··· ; α·
∂
∂θ2
J(θ0,θ1,θ2,···,θj ,···,θn) = α·0; ··· ;
α·
∂
∂θ2
J(θ0,θ1,θ2,···,θj ,···,θn) = α·0
11

In MATLAB we initialize the descent algorithm with our given input matrix X, given output matrix Y and an arbitrary
parameter vector θ. At the beginning we need to initialize the iteration steps iter and the learning rate alpha. To
check if our learning algorithm is working well we will calculate stepwise the value of our cost function. For this
reason we initialize a cost-converge-test function J_test. After the gradient descent algorithm is done we plot the
cost-converge-test function dependending on the iteration steps iters.
iter = 1000;
alpha = 0.01;
J_test = zeros(iter,1);
iters =[1:1:iter]’;
for k=1:iter
J_test(k) = 1/(2*m) * sum( (X*theta - Y).ˆ2 );
for j=1:size(X,2)
theta_temp(j) = theta(j) - alpha*1/m * sum( (X*theta - Y ).*X(:,j) );
end
theta = theta_temp;
end
plot(k,J_test)
disp([theta])
12

6 Normal Equation Algorithm (Alternative)
Gradient descent gives one way of minimizing J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn). Let’s discuss a second way of doing so,
this time performing the minimization explicitly and without resorting to an iterative algorithm. In the "Normal
Equation" method, we will minimize J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) by explicitly taking its derivatives with respect to
the θj ’s, and setting them to zero. This allows us to ﬁnd the optimum theta without iteration. The normal equation
formula is given below:
θ = (X ·X)−1
·X ·Y (6.1)
With the normal equation, computing the inversion has complexity O(n3
). So if we have a very large number of
features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a
normal solution to an iterative process.
In MATLAB we write
theta = inv(X’*X)*X’*Y;
13

7 Important Conclusions
• Gradient Descent Algorithm
– Need to choose α
– Needs many iterations O(kn2
)
– Works well when n is large
– if α is too small, gradient descent can be slow
– if α is too large, gradient descent can overshoot the minimum. It may be fail to converge, or even diverge
• Normal Equation Algorithm
– No need to choose α
– No need to iterate
– O(n3
), need to calculate inverse of (X · X)
– Slow if n is very large
– When implementing the normal equation in MATLAB we want to use the pinv function rather than inv.
The pinv function will give you a value of θ even if (X · X) is not invertible.
– If (X · X) is noninvertible, the common causes might be having
* Redundant features, where two features are very closely related (i.e. they are linearly dependent)
* Too many features (e.g. m ≤ n). In this case, delete some features or use regularization (see other
journal).
• Feature Scaling
– If range of values of the features varies widely
– Gradient descent converges much faster with feature scaling than without it
14

Glossary
Ω Input set
˘Ω Standardize input set
Υ Output set
µj Arithmetic mean vector
σj Standard deviation
x0 Bias element
x1 Feature 1 element
xj Feature j element
xn Feature n element
˘x1 Standardize feature 1 element
˘x2 Standardize feature 2 element
˘xj Standardize feature j element
˘xn Standardize feature n element
y1 Output element
u Arbitrary element in the input set
v Arbitrary element in the output set
m Number of feature examples
15

Π Training set
t(x(i)
1 ,x(i)
2 ,x(i)
3 ,··· ,x(i)
i
,··· ,x(i)
n ) Training function (discrete)
T Training matrix
X Input matrix
Y Output matrix
h(x1,x2,x3,··· ,xj ,··· ,xn) Hypothesis function (continuous)
θ0 1-st element of the parameter vector
θ1 2-nd element of the parameter vector
θ2 3-rd element of the parameter vector
θ3 4-th element of the parameter vector
θj j-th element of the parameter vector
θn (n+1)-th element of the parameter vector
θ Parameter vector
x Hypothesis-input vector
J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) Cost function
x(i)
1 i-th feature 1 value
x(i)
x(i)
x(i)
j
i-th feature j value
16

x(i)
n i-th feature n value
y(i)
1 i-th output value
α Learning rate
∂
∂θj
J(θ0,θ1,θ2,θ3,··· ,θj ,··· ,θn) Partial derivative of the cost function
X Transpose of the input matrix
n Number of feature elements
17

X02 Supervised learning problem linear regression multiple features

More Related Content

What's hot

Similar to X02 Supervised learning problem linear regression multiple features

Recently uploaded

X02 Supervised learning problem linear regression multiple features