Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HYPERPARAMETER OPTIMIZATION WITH
APPROXIMATE GRADIENT
Fabian Pedregosa
Chaire Havas-Dauphine
Paris-Dauphine / École Normal...
HYPERPARAMETERS
Most machine learning models depend on at least one
hyperparameter to control for model complexity. Exampl...
HYPERPARAMETER SELECTION
Criterion to for hyperparameter selection:
Optimize loss on unseen data: cross-validation.
Minimi...
GRADIENT-BASED HYPERPARAMETER OPTIMIZATION
Compute gradients with respect to hyperparameters
[Larsen 1996, 1998, Bengio 20...
GOAL: COMPUTE ∇f (λ)
By chain rule,
Two main approaches: implicit differentiation and iterative
differentiation [Domke et ...
GRADIENT-BASED HYPERPARAMETER OPTIMIZATION
∇f = g − g∇2 ( h)∇2
1,2
T
( h)∇2
1
−1
∇1
Possible to compute gradient w.r.t. hy...
HOAG: HYPERPARAMETER OPTIMIZATION WITH APPROXIMATE
GRADIENT
Loose approximation
Cheap iterations, might
diverge.
Precise a...
point.
HOAG At iteration perform the following:k = 1, 2, …
i) Solve the inner optimization problem up to tolerance , i.e. ...
ANALYSIS - GLOBAL CONVERGENCE
Assumptions:
(A1). Lipschits and .
(A2). non-singular
(A3). Domain is bounded.
∇g h∇2
h(X(λ)...
EXPERIMENTS
How to choose tolerance ?εk
Different strategies for the tolerance decrease. Quadratic:
, Cubic: , Exponential...
EXPERIMENTS I
Model: -regularized
logistic regression.
1 Hyperparameter.
Datasets:
20news (18k 130k )
real-sim (73k 20k)
ℓ...
EXPERIMENTS II
Kernel ridge regression.
2 hyperparameters.
Parkinson dataset: 654
17
Multinomial Logistic
regression with ...
CONCLUSION
Hyperparameter optimization with inexact gradient:
can update hyperparameters before model parameters
have full...
REFERENCES
[Y. Bengio, 2000] Bengio, Yoshua. "Gradient-based optimization of
hyperparameters." Neural computation 12.8 (20...
REFERENCES 2
[M. Schmidt et al., 2013] Schmidt, M., Roux, N. & Bach, F. Minimizing nite
sums with the stochastic average g...
EXPERIMENTS - COST FUNCTION
EXPERIMENTS
Comparison with other hyperparameter optimization
methods
Random = Random search, SMBO = Sequential Model-Base...
EXPERIMENTS
Comparison in terms of a validation loss.
Random = Random search, SMBO = Sequential Model-Based
Optimization (...
Upcoming SlideShare
Loading in …5
×

Hyperparameter optimization with approximate gradient

10,276 views

Published on

ICML talk on the paper "Hyperparameter optimization with approximate gradient"

Published in: Science
  • Be the first to comment

Hyperparameter optimization with approximate gradient

  1. 1. HYPERPARAMETER OPTIMIZATION WITH APPROXIMATE GRADIENT Fabian Pedregosa Chaire Havas-Dauphine Paris-Dauphine / École Normale Supérieure
  2. 2. HYPERPARAMETERS Most machine learning models depend on at least one hyperparameter to control for model complexity. Examples include: Amount of regularization. Kernel parameters. Architecture of a neural network. Model parameters Estimated using some (regularized) goodness of t on the data. Hyperparameters Cannot be estimated using the same criteria as model parameters (over tting).
  3. 3. HYPERPARAMETER SELECTION Criterion to for hyperparameter selection: Optimize loss on unseen data: cross-validation. Minimize risk estimator: SURE, AIC/BIC, etc. Example: least squares with regularization.ℓ2 loss = Costly evaluation function, non-convex. Common methods: grid search, random search, SMBO. ( − X(λ)∑n i=1 bi ai ) 2
  4. 4. GRADIENT-BASED HYPERPARAMETER OPTIMIZATION Compute gradients with respect to hyperparameters [Larsen 1996, 1998, Bengio 2000]. Hyperparameter optimization as nested or bi-level optimization: arg min λ∈ s.t.  X(λ) ⏟model parameters   f (λ) ≜ g(X(λ), λ)   loss on test set ∈  arg min x∈ℝp h(x, λ) ⏟loss on train set
  5. 5. GOAL: COMPUTE ∇f (λ) By chain rule, Two main approaches: implicit differentiation and iterative differentiation [Domke et al. 2012, Macaulin 2015] Implicit differentiation [Larsen 1996, Bengio 2000]: formulate inner optimization as implicit equation. ∇f = ⋅+ ∂g ∂λ ∂g ∂X   known ∂X ∂λ ⏟unknown X(λ) ∈ arg min h(x, λ) ⟺ h(X(λ), λ) = 0∇1   implicit equation for X
  6. 6. GRADIENT-BASED HYPERPARAMETER OPTIMIZATION ∇f = g − g∇2 ( h)∇2 1,2 T ( h)∇2 1 −1 ∇1 Possible to compute gradient w.r.t. hyperparameters, given Solution to the inner optimization Solution to linear system X(λ) g( h)∇2 1 −1 ∇1 computationally expensive.⟹
  7. 7. HOAG: HYPERPARAMETER OPTIMIZATION WITH APPROXIMATE GRADIENT Loose approximation Cheap iterations, might diverge. Precise approximation Costly iterations, convergence to stationary Replace by an approximate solution of the inner optimization. Approximately solve linear system. Update using Tradeoff X(λ) λ    ≈ ∇fpk
  8. 8. point. HOAG At iteration perform the following:k = 1, 2, … i) Solve the inner optimization problem up to tolerance , i.e. nd such that ii) Solve the linear system up to tolerance . That is, nd such that iii) Compute approximate gradient as iv) Update hyperparameters: εk ∈xk ℝp ∥X( ) − ∥ ≤ .λk xk εk εk qk ∥ h( , ) − g( , )∥ ≤ .∇2 1 xk λk qk ∇1 xk λk εk pk = g( , ) − h( , ,pk ∇2 xk λk ∇2 1,2 xk λk ) T qk = ( − ) .λk+1 P λk 1 L pk
  9. 9. ANALYSIS - GLOBAL CONVERGENCE Assumptions: (A1). Lipschits and . (A2). non-singular (A3). Domain is bounded. ∇g h∇2 h(X(λ), λ)∇2 1  Corollary: If , then converges to a stationary point : if is in the interior of then < ∞∑∞ i=1 εi λk λ∗ ⟨∇f ( ), α − ⟩ ≥ 0 , ∀α ∈ λ∗ λ∗ ⟹ λ∗  ∇f ( ) = 0λ∗
  10. 10. EXPERIMENTS How to choose tolerance ?εk Different strategies for the tolerance decrease. Quadratic: , Cubic: , Exponential:= 0.1/εk k 2 0.1/k 3 0.1 × 0.9 k Approximate-gradient strategies achieve much faster decrease in early iterations.
  11. 11. EXPERIMENTS I Model: -regularized logistic regression. 1 Hyperparameter. Datasets: 20news (18k 130k ) real-sim (73k 20k) ℓ2 × ×
  12. 12. EXPERIMENTS II Kernel ridge regression. 2 hyperparameters. Parkinson dataset: 654 17 Multinomial Logistic regression with one hyperparameter per feature [Maclaurin et al. 2015] 784 10 hyperparameters MNIST dataset: 60k 784 × × ×
  13. 13. CONCLUSION Hyperparameter optimization with inexact gradient: can update hyperparameters before model parameters have fully converged. independent of inner optimization algorithm. convergence guarantees under smoothness assumptions. Open questions. Non-smooth inner optimization (e.g. sparse models)? Stochastic / online approximation?
  14. 14. REFERENCES [Y. Bengio, 2000] Bengio, Yoshua. "Gradient-based optimization of hyperparameters." Neural computation 12.8 (2000): 1889-1900. [J. Bergstra, Y. Bengio 2012] Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." The Journal of Machine Learning Research 13.1 (2012): 281-305. [J. Snoek et al., 2015] Snoek, J. et al. Scalable Bayesian Optimization Using Deep Neural Networks. (2015). at [K. Swersky et al., 2014] Swersky, K., Snoek, J. & Adams, R. Freeze-Thaw Bayesian Optimization. arXiv Prepr. arXiv1406.3896 1–12 (2014). at [F. Hutter et al., 2013] Hutter, F., Hoos, H. & Leyton-Brown, K. An evaluation of sequential model-based optimization for expensive blackbox functions. http://arxiv.org/abs/1502.05700a http://arxiv.org/abs/1406.3896
  15. 15. REFERENCES 2 [M. Schmidt et al., 2013] Schmidt, M., Roux, N. & Bach, F. Minimizing nite sums with the stochastic average gradient. arXiv Prepr. arXiv1309.2388 1–45 (2013). at [J. Domke et al., 2012] Domke, J. Generic Methods for Optimization-Based Modeling. Proc. Fifteenth Int. Conf. Artif. Intell. Stat. XX, 318–326 (2012). [M. P. Friedlander et al., 2012] Friedlander, M. P. & Schmidt, M. Hybrid Deterministic-Stochastic Methods for Data Fitting. SIAM J. Sci. Comput. 34, A1380–A1405 (2012). http://arxiv.org/abs/1309.2388
  16. 16. EXPERIMENTS - COST FUNCTION
  17. 17. EXPERIMENTS Comparison with other hyperparameter optimization methods Random = Random search, SMBO = Sequential Model-Based Optimization (Gaussian process), Iterdiff = reverse-mode differentiation .
  18. 18. EXPERIMENTS Comparison in terms of a validation loss. Random = Random search, SMBO = Sequential Model-Based Optimization (Gaussian process), Iterdiff = reverse-mode differentiation .

×