線形回帰モデル

LINEAR MODEL
2016/06/08
Yagi Takayuki

REFERENCE
Pattern Recognition and Machine Learning (PRML)
chapter3,4

TABLE OF CONTENTS
1. linear regression
1.1. what is regression
1.2. linear regression
1.3. ridge regression
1.4. lasso regression
1.5. generalization
1.6. maximum likelihood estimation
1.7. MAP estimation
2. linear classification
2.1. multi-class classification
2.2. disadvandages of least squares method

we want to know a line that most fitting
WHAT IS REGRESSION
there are some of data

we want to know such a line in any situation.
-> regression analysis
WHAT IS REGRESSION
is best fittingy = 0.6147 + 1.0562x1

NOTATION
: scalarx, w, t
: vectorx, w, t
: matrixX, W, T

LINEAR MODEL
the simplest model is
y(x, w) = + + ⋯ +w0 w1 x1 wD xD
: weight parameteswi : variablesxi : the number of variablesD

FEATURE
linear with respect to
model is too simple (poor expressive power)
w
x

EXTEND THE MODEL
add linear combination of the non-linear function
y(x, w) = + (x)w0 ∑M−1
j=1
wj ϕj
is number of basis functionsM − 1
called basis function(x)ϕj
called bias parameterw0

LINEAR MODEL
if we add dummy basis function( )(x) = 1ϕ0
y(x, w) = (x) = ϕ(x)∑M−1
j=0
wj ϕj w
T
w = ( , …,w0 wM−1 )
T
ϕ(x) = ( (x), …, (x)ϕ0 ϕM−1 )
T

BASIS FUNCTION
there are various choices for the basis function
polynomial basis
gaussian basis
logistic sigmoid basis

GAUSSIAN BASIS
(x) = exp
(
− )
ϕj
(x−μj
)
2
2s
2

LOGISTIC SIGMOID BASIS
(x) = σ( )
ϕj
x−μj
s
σ(a) =
1
1+exp (−a)

LINEAR MODEL
y(x, w) = ϕ(x)w
T

FEATURE
non-linear with respect to
we can choose a favorite basis
w
x

LINEAR REGRESSION
we want to find the best
y(x, w) = ϕ(x)w
T
w
is best fittingy = 0.6147 + 1.0562x1

HOW TO REGRESSION
reducing the error

THE MAIN IDEA OF REGRESSION
Minimization of the error function
: number of data
: i-th data
: target value of i-th data
E(w) = ( ϕ( ) −1
2
∑N
i=1
w
T
x
(i)
t
(i)
)
2
N
x
(i)
t
(i)
※ called least squares method
※
E(w)minw
called sum-of-squares errorE(w)

LINEAR REGRESSION
we want to minimize the error function
E(w)minw
E(w) = ( ϕ( ) − = ||Φw − t|
1
2
∑N
i=1
w
T
x
(i)
t
(i)
)
2 1
2
|
2
Φ = (ϕ( ), ϕ( ), …, ϕ( )x
(1)
x
(2)
x
(N)
)
T
t = ( , , …,t
(1)
t
(2)
t
(N)
)
T

LINEAR REGRESSION
E(w) = ||Φw − t|
1
2
|
2
partial diﬀerential in w
= (Φw − t)
∂E(w)
∂w
Φ
T
put with = 0
E(∂w)
∂w
Φw = tΦ
T
Φ
T
∴ w = ( Φ tΦ
T
)
−1
Φ
T
implementation

REDGE REGRESSIN
E(w)minw
E(w) = ||Φw − t| + ||w|
1
2
|
2 λ
2
|
2
※ called L2 regularization term||w||
2

REDGE REGRESSION
E(w) = ||Φw − t| + ||w|
1
2
|
2 λ
2
|
2
= (Φw − t) + λw = 0
E(∂w)
∂w
Φ
T
( Φ − λI)w = tΦ
T
Φ
T
∴ w = ( Φ − λI tΦ
T
)
−1
Φ
T
implementation

LASSO REGRESSION
E(w) = ||Φw − t| + |w|
1
2
|
2 λ
2
∑M−1
j=1
※ called L1 regularization term|w|

LASSO REGRESSION
not be solved analytically (nondiﬀerentiable)
solved by coordinate descent
perform variable selection(some of the parameters to 0)
implementation

GENERALIZATION
general of the redge and lasso
E(w) = ||Φt − t| + |w
1
2
|
2 λ
2
∑M−1
j=1
|
q

RE-EXPRESSION
is equal to
s.t.
(
||Φw − t| + |w
)
minw
1
2
|
2 λ
2
∑M
j=1
|
q
||Φw − t|minw
1
2
|
2
| ≤ η∑M
j=1
wj |
q
※ is calculated from lagrange multiplier methodη

IMAGE
le : redge regression
right : lasso regression

MAXIMIZE LIKELIHOOD AND SUM-OF-SQUARES ERROR
we think is represented by sum of and Gaussian noiset y(x, w)
t = y(x, w) + ε

MAXIMIZE LIKELIHOOD AND SUM-OF-SQUARES ERROR
p(t|x, w, β) = N(t|y(x, w), )β−1
N(x|μ, ) = exp
(
− (x − μ )
σ2 1
(2πσ2
)
1/2
1
2σ2
)
2

LIKELIHOOD
p(t|x, w, β) = N( | ϕ( ), )∏N
n=1
tn w
T
xn β−1
※ t = ( , , …,t1 t2 tN )
T

MAXIMIZE LIKELIHOOD
ln p(t|x, w, β) = ln N( | ϕ( ), )∑N
n=1
tn w
T
xn β−1
ln p(t|x, w, β) = − ( − ϕ( ) + ln β − ln(2
β
2
∑N
n=1
tn w
T
xn )
2 N
2
N
2
therefore, maximize likelifood is equal to minimize sum-of-squares error

FORMULA DEFORMATION
p(t|X, w, β) = N( | ϕ( ), )∏N
n=1
tn w
T
xn β−1
p(t|X, w, β) =
(
exp
(
− ( ϕ( ) − )
∏N
n=1
β
2π )
1/2
β
2
w
T
x
(i)
t
(i)
)
2
ln p(t|X, w, β) = − ( − ϕ( ) + ln β − ln(
β
2
∑N
n=1
tn w
T
xn )
2 N
2
N
2

MAP ESTIMATION AND L2 REGULARIZATION
we add the prior distribution.
by using the Bayes theorem
p(w|x, t, α, β) ∝ p(t|x, w, β)p(w|α)

PRIOR DISTRIBUTION
Given this prior distribution
because calculation is easy.
p(w|α) = N(w|0, I) =
(
exp(− w)α−1 α
2π )
(M+1)/2
α
2
w
T

MAP ESTIMATION
is equal to
p(w|x, t, α, β)maxw
( ||Φw − t| + ||w| )minw
1
2
|
2 λ
2
|
2
p(w|x, t, α, β) ∝ p(t|x, w, β)p(w|α)
p(t|x, w, β) = N( | ϕ( ), )∏N
n=1
tn w
T
xn β−1
p(w|α) = N(w|0, I) =
(
exp(− w)α−1 α
2π )
(M+1)/2
α
2
w
T
so, MAP estimation is equal to redge regression

FORMULA DEFORMATION
p(t|x, w, α, β) =
((
exp
(
− ( ϕ( ) − ))(
exp(− w)∏N
n=1
β
2π )
1/2
β
2
w
T
x
(i)
t
(i)
)
2 α
2π )
(M+1)/2
α
2
w
T
ln p(t|x, w, β) = − ( − ϕ( ) + ln β − ln(2π) + ln α − ln(2π) −
β
2
∑N
n=1
tn w
T
xn )
2 N
2
N
2
M+1
2
M+1
2
α
2
w
T

SUMMARY OF LINEAR REGRESSION
I introduced some of the linear regression model
I showed maximize likelifood is equal to minimize sum-of-
squares error
I showed MAP estimation is equal to redge regression

LINEAR CLASSIFICATION
we think K-class (K>2) classification
so, we prepare K linear models
(x) = ϕ(x)yk w
T
k
y(x) = xW˜
T
= ( , , …, )W˜ w1 w2 wK
y(x) = ( (x), (x), …, (x)y1 y2 yK )
T

1-OF-K CODING
we prepare vector p
(if is class )
(if is not class )
p = ( , , …,c1 c2 cK )
T
= 1ci x i
= 0ci x i
ex) p = (0, 0, …, 0, 1, 0, …, 0)
T

E( )W˜
∂E( W)˜
∂W˜
= ||y( ) − |
1
2 ∑
i=1
N
x
(i)
p
(i)
|
2
= || ϕ( ) − |
1
2 ∑
i=1
N
W˜
T
x
(i)
p
(i)
|
2
= ( ϕ( ) − ( ϕ( ) − )
1
2 ∑
i=1
N
W˜
T
x
(i)
p
(i)
)
T
W˜
T
x
(i)
p
(i)
=
(
ϕ( ϕ( ) − 2ϕ( + || ||
)
1
2 ∑
i=1
N
x
(i)
)
T
W˜W˜ x
(i)
x
(i)
)
T
W˜p
(i)
p
(i)
=
(
ϕ( )ϕ( − ϕ( )
)∑
i=1
N
x
(i)
x
(i)
)
T
W˜ x
(i)
p
(i)
T
LEAST-SQUARES METHOD

LEAST-SQUARES METHOD
by
= X − P
∂E( )W˜
∂W˜
X
T
W˜ X
T
= 0
∂E( )W˜
∂W˜
X = PX
T
W˜ X
T
∴ = ( X PW˜ X
T
)
−1
X
T
X
P
= (ϕ( ), ϕ( ), …, ϕ( )x
(1)
x
(2)
x
(N)
)
T
= ( , , …, )p
(i)
p
(2)
p
(N)

WEAK TO OUTLIERS
red : least square method
green : logistic regression

ASSUMING A NORMAL DISTRIBUTION
le : least square method
right : logistic regression

DISADVANTAGES OF LEAST-SQUARES
METHOD
not handle the label as the probability
weak to outliers
assuming a normal distribution
if data does not follow the normal distribution, bad
result
we shoud not use least-squares method in classification
problem

SUMMARY
I introduced linear model (regression, classification)
there are the basis for some of other machine learning
model
PRML is diﬀicult for me, but I want to continue reading

線形回帰モデル

More Related Content

What's hot

Viewers also liked

Similar to 線形回帰モデル

Recently uploaded

線形回帰モデル