2. TABLE OF CONTENTS
1. where it is used
1.1 what is regression
1.2 how to regression
2. introduce some of gradient methods
3. comparison of methods (benchmark)
16. THE MAIN IDEA OF REGRESSION
Minimization of the error function
it is called least squares method
E = ( − f ( )
1
2
∑n
i=1
yi xi )
2
17. SOLVE ??(x) = 0f
′
it is difficult to solve this equation
instead, we use optimization technique
18. WHERE IT IS USED ?
optimizaiton technique is used in optimization problem
machine learning is a kind of optimization problem
19. KIND OF OPTIMAIZATION ALGORITHM
using gradient
steepest descent method
newtons method
conjugate gradient method
not using gradient
genetic algorithm (GA)
simulated annealing (SA)
tabu search (TS)
I will introduce optimaization algorithm using a gradient
20. THERE ARE ALGORITHMS WHICH I
INTORODUCE
steepest descent method
momentum method
nesterovs accelerated gradient method
newton-raphson method
conjugate gradient method
quasi newton method
AdaGrad
RMSprop
AdaDelta
Adam
22. CONFIGURATION
we think about minimization of we want to know
which takes minimum of
f (x) x
f (x)
is n-dimensional vector
is gradient of
is Hessian matrix
x
∇f f
H
24. STEEPEST DESCENT METHOD
1. initialize
2. update
3. back to step2 until
4. return
x
x
← − α∇f ( )x
k+1
x
k
x
k
| − | < εx
k+1
x
k
x
※ ∇f (x) =
( , , …,
)
∂f (x)
∂x1
∂f (x)
∂x2
∂f (x)
∂xn
25. STEEPEST DESCENT METHOD
1. initialize
2. update
3. back to step2 until
4. return
x
x
← − α∇f ( )x
k+1
x
k
x
k
| − | < εx
k+1
x
k
x
※ ∇f (x) =
( , , …,
)
∂f (x)
∂x1
∂f (x)
∂x2
∂f (x)
∂xn
later I will introduce only update expression
27. FEATURE
implementation is easy
easy to arrive at local optimal solution
explore from a lot of initial value
add randomness (Stochastic Gradient Descent)
we need to calculate ∇f
34. FEATURE
quadratic convergence (execution time is fast)
require inverse of hessian matrix(computational cost is
high)
preparing hessian matrix is difficult
need calcuration of inverse matrix( )( )n
3
35. CONJUGATE GRADIENT METHOD
1.
2.
3.
← −βk ( , ∇f ( ))m
k−1
H
k
x
k
( , )m
k−1
H
k
m
k−1
← ∇f ( ) +m
k
x
k
βk
m
k−1
← + αx
k+1
x
k
m
k
is inner product of(a, b) anda b
39. RMSPROP
1.
2.
← γ + (1 − γ)∇f (r
k+1
r
k
x
k
)
2
← − ∇f ( )x
k+1
x
k α
+εr
k+1
√
x
k
40. ADADELTA
1.
2.
3.
4.
← γ + (1 − γ)∇f (r
k+1
r
k
x
k
)
2
← ∇f ( )v
k+1
+εs
√
k
+εr√
k+1
x
k
← γ + (1 − γ)s
k+1
s
k
v
k+1
2
← −x
k+1
x
k
v
k+1
41. ADAM
1.
2.
3.
← β + (1 − β)∇f ( )v
k+1
v
k
x
k
← γ + (1 − γ)r
k+1
r
k
f ( )x
k
2
← −x
k+1
x
k α
+εr
1−γt
√
v
1−βt
42. BENCHMARK TEST1 BY MNIST
compared method
steepest descent method
momentum method
nesterovs accelerated gradient method
AdaGrad
RMSProp
AdaDelta
Adam
※ we can not use method using Hessian matrix
45. THE STRUCTURE OF NN
input layer : 784 neurons(28*28 dimension)
hidden layer : 100 neurons
output layer : 10 neurons (10 class)
activating function of hidden layer : sigmoid function
activating function of output layer : so max function
optimization method : each method
46. PARAMETER
I set parameters in each method by reference to
I select default parameters of each method
this website
48. RESULT1
AdaDelta, Adam, RMSProp showed good result
Note that I do not parameter tuning
in many cases, adam seems the best result
49. IS ADAM ALWAYS BEST APPROACH??
optimal method must be selected depending on the
problem
method using hessian matrix is very fast
it is better to chose, if you can choose
50. BENCHMARK TEST2 BY REGRESSION
compared method
steepest descent method
Adam
quasi newton method
51. THE STRUCTURE OF NN
input layer : 1 neurons
hidden layer : 3 neurons
output layer : 1 neurons
activating function of hidden layer : sigmoid function
activating function of output layer : identity function
optimization method : each method