Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

On the Convergence of Single-Call Stochastic Extra-Gradient Methods

32 views

Published on

Slides for the paper "On the Convergence of Single-Call Stochastic Extra-Gradient Methods" appearing in NeurIPS 2019
https://arxiv.org/abs/1908.08465

Published in: Science
  • Be the first to comment

  • Be the first to like this

On the Convergence of Single-Call Stochastic Extra-Gradient Methods

  1. 1. On the Convergence of Single-Call Stochastic Extra-Gradient Methods Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos NeurIPS, December 2019 Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 0 / 22
  2. 2. Outline: 1 Variational Inequality 2 Extra-Gradient 3 Single-call Extra-Gradient [Main Focus] 4 Conclusion Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
  3. 3. Variational Inequality Variational Inequality Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
  4. 4. Variational Inequality Introduction: Variational Inequalities in Machine Learning Generative adversarial network (GAN) min θ max φ Ex∼pdata [log(Dφ(x))] + Ez∼pZ [log(1 − Dφ(Gθ(z)))]. More min-max (saddle point) problems: distributionally robust learning, primal-dual formulation in optimization, . . . Seach of equilibrium: games, multi-agent reinforcement learning, . . . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 2 / 22
  5. 5. Variational Inequality Definition Stampacchia variational inequality Find x⋆ ∈ X such that ⟨V (x⋆ ),x − x⋆ ⟩ ≥ 0 for all x ∈ X. (SVI) Minty variational inequality Find x⋆ ∈ X such that ⟨V (x),x − x⋆ ⟩ ≥ 0 for all x ∈ X. (MVI) With closed convex set X ⊆ Rd and vector field V Rd → Rd . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 3 / 22
  6. 6. Variational Inequality Illustration SVI: V (x⋆ ) belongs to the dual cone DC(x⋆ ) of X at x⋆ [ local ] MVI: V (x) forms an acute angle with tangent vector x − x⋆ ∈ TC(x⋆ ) [ global ] Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 4 / 22
  7. 7. Variational Inequality Example: Function Minimization min x f(x) subject to x ∈ X f X → R differentiable function to minimize. Let V = ∇f. (SVI) ∀x ∈ X, ⟨∇f(x⋆ ),x − x⋆ ⟩ ≥ 0 [ first-order optimality ] (MVI) ∀x ∈ X, ⟨∇f(x),x − x⋆ ⟩ ≥ 0 [x⋆ is a minimizer of f ] If f is convex, (SVI) and (MVI) are equivalent. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 5 / 22
  8. 8. Variational Inequality Example: Saddle point Problem Find x⋆ = (θ⋆ ,φ⋆ ) such that L(θ⋆ ,φ) ≤ L(θ⋆ ,φ⋆ ) ≤ L(θ,φ⋆ ) for all θ ∈ Θ and all φ ∈ Φ. X ≡ Θ × Φ and L X → R differentiable function. Let V = (∇θL,−∇φL). (SVI) ∀(θ,φ) ∈ X, ⟨∇θL(x⋆ ),θ − θ⋆ ⟩ − ⟨∇φL(x⋆ ),φ − φ⋆ ⟩ ≥ 0 [ stationary ] (MVI) ∀(θ,φ) ∈ X, ⟨∇θL(x),θ − θ⋆ ⟩ − ⟨∇φL(x),φ − φ⋆ ⟩ ≥ 0 [ saddle point ] If L is convex-concave, (SVI) and (MVI) are equivalent. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 6 / 22
  9. 9. Variational Inequality Monoticity The solutions of (SVI) and (MVI) coincide when V is continuous and monotone, i.e., ⟨V (x′ ) − V (x),x′ − x⟩ ≥ 0 for all x,x′ ∈ Rd . In the above two examples, this corresponds to either f being convex or L being convex-concave. The operator analogue of strong convexity is strong monoticity ⟨V (x′ ) − V (x),x′ − x⟩ ≥ α x′ − x 2 for some α > 0 and all x,x′ ∈ Rd . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
  10. 10. Extra-Gradient Extra-Gradient Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
  11. 11. Extra-Gradient From Forward-backward to Extra-Gradient Forward-backward Xt+1 = ΠX (Xt − γtV (Xt)) (FB) Extra-Gradient [Korpelevich 1976] Xt+ 1 2 = ΠX (Xt − γtV (Xt)) Xt+1 = ΠX (Xt − γtV (Xt+ 1 2 )) (EG) The Extra-Gradient method anticipates the landscape of V by taking an extrapolation step to reach the leading state Xt+ 1 2 . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 8 / 22
  12. 12. Extra-Gradient From Forward-backward to Extra-Gradient Forward-backward does not converge in bilinear games, while Extra-Gradient does. min θ∈R max φ∈R θφ Left: Forward-backward Right: Extra-Gradient 4 2 0 2 4 6 6 4 2 0 2 4 6 0.5 0.0 0.5 1.0 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 9 / 22
  13. 13. Extra-Gradient Stochastic Oracle If a stochastic oracle is involved: Xt+1 2 = ΠX (Xt − γt ˆVt) Xt+1 = ΠX (Xt − γt ˆVt+1 2 ) With ˆVt = V (Xt) + Zt satisfying (and same for ˆVt+ 1 2 ) a) Zero-mean: E[Zt Ft] = 0. b) Bounded variance: E[ Zt 2 Ft] ≤ σ2 . (Ft)t∈N/2 is the natural filtration associated to the stochastic process (Xt)t∈N/2. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 10 / 22
  14. 14. Extra-Gradient Convergence Metrics Ergodic convergence: restricted error function ErrR(ˆx) = max x∈XR ⟨V (x), ˆx − x⟩, where XR ≡ X ∩ BR(0) = {x ∈ X x ≤ R}. Last iterate convergence: squared distance dist(ˆx,X⋆ )2 . Lemma [Nesterov 2007] Assume V is monotone. If x⋆ is a solution of (SVI), we have ErrR(x⋆ ) = 0 for all sufficiently large R. Conversely, if ErrR(ˆx) = 0 for large enough R > 0 and some ˆx ∈ XR, then ˆx is a solution of (SVI). Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 11 / 22
  15. 15. Extra-Gradient Literature Review We further suppose that V is β-Lipschitz. Convergence type Hypothesis Korpelevich 1976 Last iterate asymptotic Pseudo monotone Tseng 1995 Last iterate geometric Monotone + error bound (e.g., strongly monotone, affine) Nemirovski 2004 Ergodic in O(1/t) Monotone Juditsky et al. 2011 Ergodic in O(1/ √ t) Stochastic monotone Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 12 / 22
  16. 16. Extra-Gradient In Deep Learning Extra-Gradient (EG) needs two oracle calls per iteration, while gradient computations can be very costly for deep models: And if we drop one oracle call per iteration? Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
  17. 17. Single-call Extra-Gradient [Main Focus] Single-call Extra-Gradient Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
  18. 18. Single-call Extra-Gradient [Main Focus] Algorithms 1 Past Extra-Gradient [Popov 1980] Xt+ 1 2 = ΠX (Xt − γt ˆVt− 1 2 ) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 ) (PEG) 2 Reflected Gradient [Malitsky 2015] Xt+ 1 2 = Xt − (Xt−1 − Xt) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 ) (RG) 3 Optimistic Gradient [Daskalakis et al. 2018] Xt+ 1 2 = ΠX (Xt − γt ˆVt− 1 2 ) Xt+1 = Xt+1 2 + γt ˆVt−1 2 − γt ˆVt+ 1 2 (OG) Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 14 / 22
  19. 19. Single-call Extra-Gradient [Main Focus] A First Result Proxy PEG: [Step 1] ˆVt ← ˆVt− 1 2 RG: [Step 1] ˆVt ← (Xt−1 − Xt)/γt; no projection OG: [Step 1] ˆVt ← ˆVt− 1 2 [Step 2] Xt ← Xt+ 1 2 + γt ˆVt− 1 2 ; no projection Proposition Suppose that the Single-call Extra-Gradient (1-EG) methods presented above share the same initialization, X0 = X1 ∈ X, ˆV1/2 = 0 and a same constant step-size (γt)t∈N ≡ γ. If X = Rd , the generated iterates Xt coincide for all t ≥ 1. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 15 / 22 Xt+1 2 = ΠX (Xt − γt ˆVt) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 )
  20. 20. Single-call Extra-Gradient [Main Focus] Global Convergence Rate Always with Lipschitz continuity. Stochastic strongly monotone: step size in O(1/t). New results! Monotone Strongly Monotone Ergodic Last Iterate Ergodic Last Iterate Deterministic 1/t Unknown 1/t e−ρt Stochastic 1/ √ t Unknown 1/t 1/t Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 16 / 22
  21. 21. Single-call Extra-Gradient [Main Focus] Proof Ingredients Descent Lemma [Deterministic + Monotone] There exists (µt)t∈N ∈ RN + such that for all p ∈ X, Xt+1 − p 2 + µt+1 ≤ Xt − p 2 − 2γ⟨V (Xt+1 2 ),Xt+ 1 2 − p⟩ + µt. Descent Lemma [Stochastic + Strongly Monotone] Let x⋆ be the unique solution of (SVI). There exists (µt)t∈N ∈ RN + ,M ∈ R+ such that E[ Xt+1 − x⋆ 2 ] + µt+1 ≤ (1 − αγt)(E[ Xt − x⋆ 2 ] + µt) + Mγ2 t σ2 . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 17 / 22
  22. 22. Single-call Extra-Gradient [Main Focus] Regular Solution Definition [Regular Solution] We say that x⋆ is a regular solution of (SVI) if V is C1 -smooth in a neighborhood of x⋆ and the Jacobian JacV (x⋆ ) is positive-definite along rays emanating from x⋆ , i.e., z⊺ JacV (x⋆ )z ≡ d ∑ i,j=1 zi ∂Vi ∂xj (x⋆ )zj > 0 for all z ∈ Rd ∖{0} that are tangent to X at x⋆ . To be compared with positive definiteness of the Hessian along qualified constraints in minimization; differential equilibrium in games. Localization of strong monoticity. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 18 / 22
  23. 23. Single-call Extra-Gradient [Main Focus] Local Convergence Theorem [Local convergence for stochastic non-monotone operators] Let x⋆ be a regular solution of (SVI) and fix a tolerance level δ > 0. Suppose (PEG) is run with step-sizes of the form γt = γ/(t + b) for large enough γ and b. Then: (a) There are neighborhoods U and U1 of x⋆ in X such that, if X1/2 ∈ U,X1 ∈ U1, the event E∞ = {Xt+ 1 2 ∈ U for all t = 1,2,...} occurs with probability at least 1 − δ. (b) Conditioning on the above, we have: E[ Xt − x⋆ 2 E∞] = O ( 1 t ). Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 19 / 22
  24. 24. Single-call Extra-Gradient [Main Focus] Experiments L(θ,φ) = 2 1θ⊺ A1θ + 2(θ⊺ A2θ) 2 − 2 1φ⊺ B1φ − 2(φ⊺ B2φ) 2 + 4θ⊺ Cφ 0 20 40 60 80 100 # Oracle Calls 10 −16 10 −13 10 −10 10 −7 10 −4 10 −1 ∥x−x*∥2 EG 1-EG γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 (a) Strongly monotone ( 1 = 1, 2 = 0) Deterministic Last iterate 10 0 10 1 10 2 10 3 # Oracle Calls 10 −3 10 −2 10 −1 10 0 ∥x−x*∥2 EG 1-EG γ = 0.2 γ = 0.4 γ = 0.6 γ = 0.8 (b) Monotone ( 1 = 0, 2 = 1) Deterministic Ergodic 10 0 10 1 10 2 10 3 10 4 # Oracle Calls 10 −3 10 −2 10 −1 ∥x−x*∥2 EG 1-EG γ = 0.3 γ = 0.6 γ = 0.9 γ = 1.2 (c) Non monotone ( 1 = 1, 2 = −1) Stochastic Zt iid ∼ N (0, σ 2 = .01) Last iterate (b = 15) Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
  25. 25. Conclusion Conclusion and Perspectives Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
  26. 26. Conclusion Conclusion Single-call rates ∼ Two-call rates. Localization of stochastic guarantee. Last iterate convergence: a first step to the non-monotone world. Some research directions: Bregman, universal, . . . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 21 / 22
  27. 27. Conclusion Bibliography Daskalakis, Constantinos et al. (2018). “Training GANs with optimism”. In: ICLR ’18: Proceedings of the 2018 International Conference on Learning Representations. Juditsky, Anatoli, Arkadi Semen Nemirovski, and Claire Tauvel (2011). “Solving variational inequalities with stochastic mirror-prox algorithm”. In: Stochastic Systems 1.1, pp. 17–58. Korpelevich, G. M. (1976). “The extragradient method for finding saddle points and other problems”. In: Èkonom. i Mat. Metody 12, pp. 747–756. Malitsky, Yura (2015). “Projected reflected gradient methods for monotone variational inequalities”. In: SIAM Journal on Optimization 25.1, pp. 502–520. Nemirovski, Arkadi Semen (2004). “Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems”. In: SIAM Journal on Optimization 15.1, pp. 229–251. Nesterov, Yurii (2007). “Dual extrapolation and its applications to solving variational inequalities and related problems”. In: Mathematical Programming 109.2, pp. 319–344. Popov, Leonid Denisovich (1980). “A modification of the Arrow–Hurwicz method for search of saddle points”. In: Mathematical Notes of the Academy of Sciences of the USSR 28.5, pp. 845–848. Tseng, Paul (June 1995). “On linear convergence of iterative methods for the variational inequality problem”. In: Journal of Computational and Applied Mathematics 60.1-2, pp. 237–252. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 22 / 22

×