GRADIENT METHOD
2016/05/11
YAGI TAKAYUKI
TABLE OF CONTENTS
1. where it is used
1.1 what is regression
1.2 how to regression
2. introduce some of gradient methods
3. comparison of methods (benchmark)
Continue to fitting little by little
regression analysis
Continue to fitting little by little
regression analysis
Continue to fitting little by little
regression analysis
Continue to fitting little by little
regression analysis
Continue to fitting little by little
regression analysis
Continue to fitting little by little
regression analysis
Continue to fitting little by little
regression analysis
Continue to fitting little by little
regression analysis
Continue to fitting little by little
regression analysis
Continue to fitting little by little
regression analysis
Continue to fitting little by little
regression analysis
Continue to fitting little by little
regression analysis
HOW TO REGRESSION
reducing the error
THE MAIN IDEA OF REGRESSION
Minimization of the error function
it is called least squares method
E = ( − f ( )
1
2
∑n
i=1
yi xi )
2
SOLVE ??(x) = 0f
′
it is difficult to solve this equation
instead, we use optimization technique
WHERE IT IS USED ?
optimizaiton technique is used in optimization problem
machine learning is a kind of optimization problem
KIND OF OPTIMAIZATION ALGORITHM
using gradient
steepest descent method
newtons method
conjugate gradient method
not using gradient
genetic algorithm (GA)
simulated annealing (SA)
tabu search (TS)
I will introduce optimaization algorithm using a gradient
THERE ARE ALGORITHMS WHICH I
INTORODUCE
steepest descent method
momentum method
nesterovs accelerated gradient method
newton-raphson method
conjugate gradient method
quasi newton method
AdaGrad
RMSprop
AdaDelta
Adam
TODAY'S GOAL
introduce each algorithm simply
perform a benchmark test
CONFIGURATION
we think about minimization of we want to know
which takes minimum of
f (x) x
f (x)
is n-dimensional vector
is gradient of
is Hessian matrix
x
∇f f
H
STEEPEST DESCENT METHOD
Representative example of optimaization algorithm using a
gradient
STEEPEST DESCENT METHOD
1. initialize
2. update
3. back to step2 until
4. return
x
x
← − α∇f ( )x
k+1
x
k
x
k
| − | < εx
k+1
x
k
x
※ ∇f (x) =
( , , …,
)
∂f (x)
∂x1
∂f (x)
∂x2
∂f (x)
∂xn
STEEPEST DESCENT METHOD
1. initialize
2. update
3. back to step2 until
4. return
x
x
← − α∇f ( )x
k+1
x
k
x
k
| − | < εx
k+1
x
k
x
※ ∇f (x) =
( , , …,
)
∂f (x)
∂x1
∂f (x)
∂x2
∂f (x)
∂xn
later I will introduce only update expression
STEEPEST DESCENT METHOD
← − α∇f ( )x
k+1
x
k
x
k
is step sizeα
FEATURE
implementation is easy
easy to arrive at local optimal solution
explore from a lot of initial value
add randomness (Stochastic Gradient Descent)
we need to calculate ∇f
MOMENTUM METHOD
1.
2.
← β − α∇f ( )v
k+1
v
k
x
k
← +x
k+1
x
k
v
k+1
use previous gradient
NESTEROV'S ACCELERATED GRADIENT
METHOD
1.
2.
← β − α∇f ( + β )v
k+1
v
k
x
k
v
k
← +x
k+1
x
k
v
k+1
similar method of momentum method
FEATURE
use the previous gradient
gradient direction is same -> big step
gradient direction is not same -> small step
NEWTON METHOD
← −x
k+1
x
k ( )f
′
x
k
( )f
″
x
k
move to the extreme value of second-order approximate curve
NEWTON-RAPHSON METHOD
← − ∇f ( )x
k+1
x
k
H
−1
x
k
extend newton method to many variables
HESSIAN MATRIX
FEATURE
quadratic convergence (execution time is fast)
require inverse of hessian matrix(computational cost is
high)
preparing hessian matrix is difficult
need calcuration of inverse matrix( )( )n
3
CONJUGATE GRADIENT METHOD
1.
2.
3.
← −βk ( , ∇f ( ))m
k−1
H
k
x
k
( , )m
k−1
H
k
m
k−1
← ∇f ( ) +m
k
x
k
βk
m
k−1
← + αx
k+1
x
k
m
k
is inner product of(a, b) anda b
FEATURE
there is no need to calculate Hessian matrix
execution time is fast
QUASI NEWTON METHOD
is approximateH
ADAGRAD
1.
2.
← + ∇f (r
k+1
r
k
x
k
)
2
← − ∇f ( )x
k+1
x
k α
+εr
k+1
√
x
k
RMSPROP
1.
2.
← γ + (1 − γ)∇f (r
k+1
r
k
x
k
)
2
← − ∇f ( )x
k+1
x
k α
+εr
k+1
√
x
k
ADADELTA
1.
2.
3.
4.
← γ + (1 − γ)∇f (r
k+1
r
k
x
k
)
2
← ∇f ( )v
k+1
+εs
√
k
+εr√
k+1
x
k
← γ + (1 − γ)s
k+1
s
k
v
k+1
2
← −x
k+1
x
k
v
k+1
ADAM
1.
2.
3.
← β + (1 − β)∇f ( )v
k+1
v
k
x
k
← γ + (1 − γ)r
k+1
r
k
f ( )x
k
2
← −x
k+1
x
k α
+εr
1−γt
√
v
1−βt
BENCHMARK TEST1 BY MNIST
compared method
steepest descent method
momentum method
nesterovs accelerated gradient method
AdaGrad
RMSProp
AdaDelta
Adam
※ we can not use method using Hessian matrix
we can download data from
MNIST
here
NEURAL NETWORK
I used neural network model in this benchmark test
THE STRUCTURE OF NN
input layer : 784 neurons(28*28 dimension)
hidden layer : 100 neurons
output layer : 10 neurons (10 class)
activating function of hidden layer : sigmoid function
activating function of output layer : so max function
optimization method : each method
PARAMETER
I set parameters in each method by reference to
I select default parameters of each method
this website
RESULT1
RESULT1
AdaDelta, Adam, RMSProp showed good result
Note that I do not parameter tuning
in many cases, adam seems the best result
IS ADAM ALWAYS BEST APPROACH??
optimal method must be selected depending on the
problem
method using hessian matrix is very fast
it is better to chose, if you can choose
BENCHMARK TEST2 BY REGRESSION
compared method
steepest descent method
Adam
quasi newton method
THE STRUCTURE OF NN
input layer : 1 neurons
hidden layer : 3 neurons
output layer : 1 neurons
activating function of hidden layer : sigmoid function
activating function of output layer : identity function
optimization method : each method
RESULT2-1
RESULT2-2
RESULT2-3
RESULT2-4
RESULT2
quasi newton method showed best result
it is better to chose, if you can choose
CONCLUSION
we should choise method depending on the problem
thank you
REFERENCE

勾配法