Successfully reported this slideshow.
Your SlideShare is downloading. ×

2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 35 Ad
Advertisement

More Related Content

Slideshows for you (19)

Similar to 2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers (20)

Advertisement

Recently uploaded (20)

2017-07, Research Seminar at Keio University, Metric Perspective of Stochastic Optimizers

  1. 1. Metric Perspective of Stochastic Optimizers Asahi Ushio E-mail: ushio@ykw.elec.keio.ac.jp Dept. Electronics and Electrical Engineering, Keio University, Japan Internal Research Seminar July 19, 2017 Note: the author has moved to Cogent Labs as a researcher from April 2018. Asahi Ushio Metric Perspective of Stochastic Optimizers 1 / 35
  2. 2. Overview 1 Preliminaries 2 Hessian Approximation Finite Difference Method Root Mean Square Extended Gauss-Newton: Preliminaries 3 Natural Gradient 4 Experiment 5 Conclusion 6 reference Asahi Ushio Metric Perspective of Stochastic Optimizers 2 / 35
  3. 3. Preliminaries Outline Explanation of stochastic optimizers in machine learning, especially, from the perspective of each metric. Mostly, stochastic optimizers can be divided into three types of metric. 1 Quasi-Newton Method Type 1 Finite Difference Method (FDM): SGD-QN [1], AdaDelta [2], VSGD [3, 4] 2 Extended Gauss-Newton: KSD [5] , SMD [6], HF [7] 3 LBFGS: stochastic LBFGS [8, 9], RES [10], 2 Natural Gradient Type: Natural Gradient [11], TONGA [12, 13] 3 Root Mean Square (RMS) Type: AdaGrad [14], RMSprop [15], Adam [16] Asahi Ushio Metric Perspective of Stochastic Optimizers 3 / 35
  4. 4. Preliminaries Overview of Stochastic Algorithm -­‐ TONGA -­‐ Fast  Natural  Gradient -­‐ AdaGrad -­‐ RMSProp -­‐ Adam -­‐ SGD-­‐QN -­‐ vSGD -­‐ AdaDelta -­‐ Online  LBFGS -­‐ RES -­‐ Stochastic  LBFGS Quasi-­‐Newton  (Hessian)  Approximation FDM-­‐based BFGS-­‐based Natural  Gradient  (Fisher   Information  Matrix)   Approximation RMS -­‐ Hessian-­‐Free -­‐ KSD  (Kryrov Subspace   Descent) -­‐ SMD  (Stochastic  Meta   Descent) Extended   Gauss-­‐Newton -­‐ PNLMS -­‐ IPNLMS Proportionate   Metric Asahi Ushio Metric Perspective of Stochastic Optimizers 4 / 35
  5. 5. Preliminaries Problem Setting Model: For an input xt ∈ Rn , the output ŷt ∈ R is derived by Activation : ŷt = M(zt) (1) Output : zt = Nw(xt) ∈ R. (2) Loss: With instantaneous loss function lt(w) of parameter w, L(w) = E [lt(w)]t . (3) Problem Setting M(z) Nw(x) lt(ŷt) yt Regression z wT x ||ŷt − yt||2 2 yt ∈ R Classification 1 1 + e−z wT x − [yt log(ŷt) + (1 − yt) log(1 − ŷt)] yt ∈ {0, 1} Multi-Classification ezi ∑k i=1 ezi wT i x − ∑k i=1 yt,i log(ŷi) yt,i ∈ {0, 1} Asahi Ushio Metric Perspective of Stochastic Optimizers 5 / 35
  6. 6. Preliminaries Brief Illustration of Model Input Hidden  Layer Output Activation Figure: Neural Network Input Output Activation Figure: Linear Model Asahi Ushio Metric Perspective of Stochastic Optimizers 6 / 35
  7. 7. Preliminaries Stochastic Optimization Purpose: Find ŵ = arg min w L(w) by stochastic approximation. SGD (Stochastic Gradient Descent) and Varinats wt = wt−1 − ηtgt, ηt ∈ R s.t. lim t→∞ ηt = 0 and lim t→∞ t ∑ i=1 ηi = ∞ (4) Vanilla SGD : gt = d dw lt(wt−1) Momentum : gt = γgt−1 + (1 − γ) d dw lt(wt−1), γ ∈ R NAG : gt = γgt−1 + (1 − γ) d dw lt(wt−1 − γgt−1) Minibatch : gt = 1 I I−1 ∑ i=0 d dw lt−i(wt−1), I ∈ R. Suppose lt(w) is differentiable. Asahi Ushio Metric Perspective of Stochastic Optimizers 7 / 35
  8. 8. Hessian Approximation Quasi-Newton Method Newton Method employs Hessian matrix: wt = wt−1 − Btgt (5) Bt = H−1 t , Ht = d2 L(wt−1) dw2 . Quasi-Newton employs Hessian approximation Ĥt instead of Ht. 1 FDM (Finite Difference Method): SGD-QN [1], AdaDelta [2], VSGD [3, 4] 2 Extended Gauss-Newton Approximation 3 LBFGS Diagonal approximation is often used in stochastic optimization: Ĥt = diag (h1,t . . . hn,t) (6) Asahi Ushio Metric Perspective of Stochastic Optimizers 8 / 35
  9. 9. Hessian Approximation Finite Difference Method SGD-QN SGD-QN [1] employs instantaneous estimator of Hessian: 1 hi,t : = η t E [ 1 hF DM i,τ ]t τ=1 (7) E [ 1 hF DM i,τ ]t τ=1 ≈ 1 h̄i,t := α 1 hF DM i,t + (1 − α) 1 h̄i,t−1 (8) hF DM i,t : = gi,t − gi,t−1 wi,t−1 − wi,t−2 (9) The FDM approximation (9) is called secant condition in quasi-newton method context. Asahi Ushio Metric Perspective of Stochastic Optimizers 9 / 35
  10. 10. Hessian Approximation Finite Difference Method AdaDelta SGD updates (5) can be reformulated Bt = −(wt − wt−1)g−T t . (10) AdaDelta [2] approximates Hessian by (10) : hi,t := E [ − gi,t wi,t − wi,t−1 ]t τ=1 ≈ RMS [gi,t] RMS [wi,t−1 − wi,t−2] . (11) Here wi,t − wi,t−1 is not known at t, so approximated by wi,t−1 − wi,t−2. With numerical stability parameter ϵ: (sensitive parameter in practice) RMS [gt] : = √ E [g2 τ ] t τ=1 + ϵ (12) E [ g2 τ ]t τ=1 ≈ ḡt := αg2 t + (1 − α)ḡt−1 (13) Asahi Ushio Metric Perspective of Stochastic Optimizers 10 / 35
  11. 11. Hessian Approximation Finite Difference Method VSGD: Quadratic Loss Quadratic Approximation of Loss Taylor expansion gives lt(w) ≈ lt(a) + dlt(a) dw T (w − a) + 1 2 (w − a)T dl2 t (a) dw2 (w − a), ∀a ∈ Rn For ŵt = arg min w∈Rn lt(w), lt(ŵt) = 0, dlt(ŵt) dw = 0. (14) Then lt(w) can be locally approximated by the quadratic function lt(w) ≈ 1 2 (w − ŵt)T d2 lt(ŵt) dw2 (w − ŵt). (15) Asahi Ushio Metric Perspective of Stochastic Optimizers 11 / 35
  12. 12. Hessian Approximation Finite Difference Method VSGD: Noisy Quadratic Loss Noisy Quadratic Approximation of Loss with Diagonal Hessian Approximation Suppose ŵt ∼ N(ŵ, diag ( σ2 1, . . . , σ2 n ) ): lq t (w) : = 1 2 (w − ŵt)T Ĥt(w − ŵt) = 1 2 n ∑ i=1 hi,t(wi − ŵi,t)2 (16) SGD for lq t with element-wise learning rate ηi,t: wi,t = wi,t−1 − ηi,t dlq t (wt−1) dwi = wi,t−1 − ηi,thi,t(wi,t−1 − ŵi,t) = wi,t−1 − ηi,thi,t(wi,t−1 − ŵi + ui), ui,t ∼ N(0, σ2 i ). (17) Asahi Ushio Metric Perspective of Stochastic Optimizers 12 / 35
  13. 13. Hessian Approximation Finite Difference Method VSGD: Adaptive Learning Rate Greedy Optimal Learning Rate VSGD (variance SGD) [3, 4] choose the learning rate, which minimize the conditional expectation of loss function: ηi,t : = arg min η { E [lq t (wi,t)|wi,t−1]t } = arg min η { E [( wi,t−1 − 1 η hi,t(wi,t−1 − ŵi + ui) − ŵi,t )2 ] t } = 1 hi,t (wi,t−1 − ŵi)2 (wi,t−1 − ŵi)2 + σ2 i (18) Asahi Ushio Metric Perspective of Stochastic Optimizers 13 / 35
  14. 14. Hessian Approximation Finite Difference Method VSGD: Variance Approximation In practice, ŵ is unknown so approximated by (wi,t−1 − ŵi)2 = ( E [wi,τ − ŵi,τ ] t−1 τ=1 )2 = ( E [gi,τ ] t−1 τ=1 )2 (19) (wi,t−1 − ŵi)2 + σ2 i = E [ (wi,τ − ŵi,τ )2 ]t−1 τ=1 = E [ g2 i,τ ]t−1 τ=1 (20) where E [gi,τ ] t τ=1 ≈ ḡi,t := αi,tgi,t + (1 − αi,t)ḡi,t−1 (21) E [ g2 i,τ ]t τ=1 ≈ v̄i,t := αi,tg2 i,t + (1 − αi,t)v̄i,t−1. (22) Asahi Ushio Metric Perspective of Stochastic Optimizers 14 / 35
  15. 15. Hessian Approximation Finite Difference Method VSGD: Hessian Approximation Hessian is approximated by hi,t := E [hi,τ ] t τ=1. Two approximation of E [hi,τ ] t τ=1 based on FDM: (Scheme 1) h̄i,t := αi,tĥi,t + (1 − αi,t)h̄i,t (23) (Scheme 2) h̄i,t := E [( ĥi,t )2 ] E [ ĥi,t ] (24) E [( ĥi,t )2 ] ≈ vi,t := αi,t ( ĥi,t )2 + (1 − αi,t)vi,t E [ ĥi,t ] ≈ mi,t := αi,tĥi,t + (1 − αi,t)mi,t where ĥi,t := gi,t − dlt(wi,t + ḡi,t)/dw ḡi,t . (25) Asahi Ushio Metric Perspective of Stochastic Optimizers 15 / 35
  16. 16. Hessian Approximation Finite Difference Method VSGD: Weight Decay Weight Sequence: Weight is update by following heuristic rule αi,t := ( 1 − ḡ2 i,t v̄i,t ) αi,t−1 + 1. (26) Asahi Ushio Metric Perspective of Stochastic Optimizers 16 / 35
  17. 17. Hessian Approximation Root Mean Square RMS: AdaGrad, RMSprop, Adam AdaGrad [14], Adam [16], RMSProp [15] can be summarized as Bt = ηtRt (27) Rt : = diag (1/r1,t . . . 1/rn,t) (28) ri,t : = RMS [gi,t] = √ E [ g2 i,t ] (29) Approximation of expectation and learning rate is different: (AdaGrad) E [ g2 i,t ] ≈ 1 t t ∑ τ=1 g2 i,τ , ηt = η/ √ t (30) (RMSProp) E [ g2 i,t ] ≈ v̄i,t, ηt = η (31) (Adam) E [ g2 i,t ] ≈ v̂i,t := v̄i,t 1 − αt , ηt = η 1 1 − γt (32) where v̄i,t := αg2 i,t + (1 − α)v̄i,t−1. Asahi Ushio Metric Perspective of Stochastic Optimizers 17 / 35
  18. 18. Hessian Approximation Root Mean Square Adam Moment Bias For the momentum sequence: gt := γgt−1 + (1 − γ) d dw lt(wt−1) = (1 − γ) t ∑ i=1 γt−i d dw li(wi−1), the expectation of gt includes moment bias (1 − γt ) such as E [gt]t = E [ (1 − γ) t ∑ i=1 γt−i d dw li(wi−1) ] t = E [ d dw lt(wt−1) ] t (1 − γ) t ∑ i=1 γt−i = E [ d dw lt(wt−1) ] t (1 − γt ). Adam’s learning rate is aimed to reduce the moment bias. Asahi Ushio Metric Perspective of Stochastic Optimizers 18 / 35
  19. 19. Hessian Approximation Extended Gauss-Newton: Preliminaries Extended Gauss-Newton Relation to the Gauss-Newton Gauss-Newton is an approximation of Hessian, which is limited to the squared loss function. Extended Gauss-Newton is an extension of Gauss-Newton by [6]. 1 Applicable to any loss function. Multilayer Perceptron Model The loss function (3) can be seen as L(ŷ) = E [lt(ŷt)]t (33) where Activation : ŷt = M(zt) (1) Output : zt = Nw(xt). (2) Asahi Ushio Metric Perspective of Stochastic Optimizers 19 / 35
  20. 20. Hessian Approximation Extended Gauss-Newton: Preliminaries Extended Gauss-Newton: Hessian Derivation Hessian of (33) is H = d2 L(ŷ) dw2 = d dw { dz dw ( dŷ dz dL(ŷ) dŷ )} = dz dw d dw ( dŷ dz dL(ŷ) dŷ )T + d2 z dw2 ( dŷ dz dL(ŷ) dŷ ) = dz dw dz dw T d dz ( dŷ dz dL(ŷ) dŷ ) + d2 z dw2 ( dŷ dz dL(ŷ) dŷ ) = JN JT N d dz ( dŷ dz dL(ŷ) dŷ ) + d2 z dw2 ( dŷ dz dL(ŷ) dŷ ) (34) where JN = dz dw is Jacobian. Asahi Ushio Metric Perspective of Stochastic Optimizers 20 / 35
  21. 21. Hessian Approximation Extended Gauss-Newton: Preliminaries Extended Gauss-Newton: General Case and Regression Extended Gauss-Newton Ignore the 2nd order derivation in (34), HGN = JN JT N d dz ( dŷ dz dL(ŷ) dŷ ) . (35) Regression: Since dŷ dz = 1, d dz ( dŷ dz dL(ŷ) dŷ ) = Et [ d dz d dŷ lt(ŷ) ] = Et [ d dz (ŷ − yt) ] = 1. (36) Extended Gauss-Newton approximation is HGN = JN JT N (37) Asahi Ushio Metric Perspective of Stochastic Optimizers 21 / 35
  22. 22. Hessian Approximation Extended Gauss-Newton: Preliminaries Extended Gauss-Newton: Classification d dz ( dŷ dz dL(ŷ) dŷ ) = Et [ d dz ( dŷ dz d dŷ lt(ŷ) )] = −Et [ d dz ( (ŷ − ŷ2 ) ( yt ŷ − 1 − yt 1 − ŷ ))] = Et [ d dz (ŷ − yt) ] = ŷ − ŷ2 (38) where dŷ dz = e−z (1 + e−z)2 = 1 1 + e−z − 1 (1 + e−z)2 = ŷ − ŷ2 d dŷ lt(ŷ) = − ( yt ŷ − 1 − yt 1 − ŷ ) Extended Gauss-Newton approximation is HGN = (ŷ − ŷ2 )JN JT N (39) Asahi Ushio Metric Perspective of Stochastic Optimizers 22 / 35
  23. 23. Hessian Approximation Extended Gauss-Newton: Preliminaries Extended Gauss-Newton: Multi-class Classification d dzi ( dŷi dzi d dŷi lt(ŷ) ) = d dzi      ezi ∑k i=1 ezi − ( ezi ∑k i=1 ezi )2    ( − ∑k i=1 ezi ezi )  = d dzi [ ezi ∑k i=1 ezi − 1 ] = ezi ∑k i=1 ezi − ( ezi ∑k i=1 ezi )2 = ŷi − ŷ2 i (40) Extended Gauss-Newton approximation is HGN,i = (ŷi − ŷ2 i )JN,iJT N,i (41) Asahi Ushio Metric Perspective of Stochastic Optimizers 23 / 35
  24. 24. Natural Gradient Natural Gradient Fisher Information Matrix Suppose the observation y is sampled via y ∼ p(ŷ). (42) Then fisher information matrix becomes F = d log p(y|ŷ) dw d log p(y|ŷ) dw T . (43) Natural Gradient wt = wt−1 − Btgt (44) Bt = F−1 Asahi Ushio Metric Perspective of Stochastic Optimizers 24 / 35
  25. 25. Natural Gradient Natural Gradient: Regression Regression: Suppose gaussian distribution, p(y|ŷ, σ) = 1 √ 2πσ2 e− (ŷ−y)2 2σ2 . (45) Fisher information matrix: F = (ŷ − y)2 σ4 JN JT N (46) where d log p(y|ŷ, σ) dw = d dw { − (ŷ − y)2 2σ2 } = − ŷ − y σ2 dŷ dw = − ŷ − y σ2 dz dw dŷ dz = − ŷ − y σ2 JN (47) Asahi Ushio Metric Perspective of Stochastic Optimizers 25 / 35
  26. 26. Natural Gradient Natural Gradient: Classification Classification: Suppose binomial distribution, p(y|ŷ) = ŷy (1 − ŷ)1−y . (48) Fisher information matrix: F = (y − ŷ)2 JN JT N (49) where d log p(y|ŷ, σ) dw = dz dw dŷ dz d dŷ {y log ŷ + (1 − y) log (1 − ŷ)} = JN dŷ dz ( y ŷ − 1 − y 1 − ŷ ) = JN (1 − ŷ)ŷ ( y ŷ − 1 − y 1 − ŷ ) = JN (y − ŷ) (50) Asahi Ushio Metric Perspective of Stochastic Optimizers 26 / 35
  27. 27. Natural Gradient Natural Gradient: Multi-class Classification Multi-class Classification: Suppose multinomial distribution p(y1, .., yk|ŷ1, ..., ŷk) = k ∏ i=1 ŷyi i (51) Fisher information matrix: Fi = {(1 − ŷi)yi}2 JN,iJT N,i (52) where d log {p(y1, .., yk|ŷ1, ..., ŷk)} dwi = dzi dwi dŷi dzi d ∑k i=1 yi log(ŷi) dŷi (53) = JN,i { (ŷi − ŷ2 i ) yi ŷi } = JN,i(1 − ŷi)yi (54) Asahi Ushio Metric Perspective of Stochastic Optimizers 27 / 35
  28. 28. Experiment Experiment: Settings Data Regression: Synthetic data, 1000 features. Classification: MNIST (hand written digits), 764 features, 1 ∼ 9 labels. Hyperparameters Grid Search: Employ the best performed hyperparameters. 1 Learning rate. 2 Weight of RMS. Evaluation Regression: Mean square error. Classification: Misclassification rate. Asahi Ushio Metric Perspective of Stochastic Optimizers 28 / 35
  29. 29. Experiment Experiment: Regression 0 5000 10000 15000 20000 25000 Iteration 0 20000 40000 60000 Error Rate SGD AdaGrad RMSprop Adam AdaDelta VSGD VSGD is the best even though tuning free. Asahi Ushio Metric Perspective of Stochastic Optimizers 29 / 35
  30. 30. Experiment Experiment: Classification 0 10000 20000 30000 40000 50000 60000 Iteration 0.2 0.4 0.6 0.8 Error Rate SGD VSGD AdaGrad RMSprop Adam AdaDelta AdaDelta is the best even though tuning free. Asahi Ushio Metric Perspective of Stochastic Optimizers 30 / 35
  31. 31. Conclusion Conclusion & Future Work Conclusion Summarize stochastic optimizers from the view of its metric. Quasi-Newton type, RMS type and Natural Gradient type. Conduct brief experiments (classification and regression). See the efficacies of tuning free algorithm. Asahi Ushio Metric Perspective of Stochastic Optimizers 31 / 35
  32. 32. reference Antoine Bordes, Léon Bottou, and Patrick Gallinari, “Sgd-qn: Careful quasi-newton stochastic gradient descent,” Journal of Machine Learning Research, vol. 10, no. Jul, pp. 1737–1754, 2009. Matthew D Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012. Tom Schaul, Sixin Zhang, and Yann LeCun, “No more pesky learning rates,” in International Conference on Machine Learning, 2013, pp. 343–351. Tom Schaul and Yann LeCun, “Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients,” in International Conference on Learning Representations, Scottsdale, AZ, 2013. Oriol Vinyals and Daniel Povey, “Krylov subspace descent for deep learning,” in Artificial Intelligence and Statistics, 2012, pp. 1261–1268. Asahi Ushio Metric Perspective of Stochastic Optimizers 32 / 35
  33. 33. reference Nicol N Schraudolph, “Fast curvature matrix-vector products for second-order gradient descent,” Neural computation, vol. 14, no. 7, pp. 1723–1738, 2002. James Martens, “Deep learning via hessian-free optimization,” in Proceedings of the 27th International Conference on Machine Learning, 2010, pp. 735–742. Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer, “A stochastic quasi-newton method for large-scale optimization,” SIAM Journal on Optimization, vol. 26, no. 2, pp. 1008–1031, 2016. Nicol N Schraudolph, Jin Yu, and Simon Günter, “A stochastic quasi-newton method for online convex optimization,” in Artificial Intelligence and Statistics, 2007, pp. 436–443. Aryan Mokhtari and Alejandro Ribeiro, “Res: Regularized stochastic bfgs algorithm,” IEEE Transactions on Signal Processing, vol. 62, no. 23, pp. 6089–6104, 2014. Asahi Ushio Metric Perspective of Stochastic Optimizers 33 / 35
  34. 34. reference Shun-Ichi Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998. Nicolas L Roux, Pierre-Antoine Manzagol, and Yoshua Bengio, “Topmoumoute online natural gradient algorithm,” in Advances in neural information processing systems, 2008, pp. 849–856. Nicolas L Roux and Andrew W Fitzgibbon, “A fast natural newton method,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 623–630. John Duchi, Elad Hazan, and Yoram Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011. Asahi Ushio Metric Perspective of Stochastic Optimizers 34 / 35
  35. 35. reference Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012. Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning and Representations, 2015, pp. 1–13. Asahi Ushio Metric Perspective of Stochastic Optimizers 35 / 35

×