On the Convergence of Single-Call Stochastic Extra-Gradient Methods
Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos
NeurIPS, December 2019
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 0 / 22
Outline:
1 Variational Inequality
2 Extra-Gradient
3 Single-call Extra-Gradient [Main Focus]
4 Conclusion
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
Variational Inequality
Variational Inequality
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
Variational Inequality
Introduction: Variational Inequalities in Machine Learning
Generative adversarial network (GAN)
min
θ
max
φ
Ex∼pdata
[log(Dφ(x))] + Ez∼pZ
[log(1 − Dφ(Gθ(z)))].
More min-max (saddle point) problems: distributionally robust learning, primal-dual
formulation in optimization, . . .
Seach of equilibrium: games, multi-agent reinforcement learning, . . .
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 2 / 22
Variational Inequality
Definition
Stampacchia variational inequality
Find x⋆
∈ X such that ⟨V (x⋆
),x − x⋆
⟩ ≥ 0 for all x ∈ X. (SVI)
Minty variational inequality
Find x⋆
∈ X such that ⟨V (x),x − x⋆
⟩ ≥ 0 for all x ∈ X. (MVI)
With closed convex set X ⊆ Rd
and vector field V Rd
→ Rd
.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 3 / 22
Variational Inequality
Illustration
SVI: V (x⋆
) belongs to the dual cone DC(x⋆
) of X at x⋆
[ local ]
MVI: V (x) forms an acute angle with tangent vector x − x⋆
∈ TC(x⋆
) [ global ]
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 4 / 22
Variational Inequality
Example: Function Minimization
min
x
f(x)
subject to x ∈ X
f X → R differentiable function to minimize.
Let V = ∇f.
(SVI) ∀x ∈ X, ⟨∇f(x⋆
),x − x⋆
⟩ ≥ 0 [ first-order optimality ]
(MVI) ∀x ∈ X, ⟨∇f(x),x − x⋆
⟩ ≥ 0 [x⋆
is a minimizer of f ]
If f is convex, (SVI) and (MVI) are equivalent.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 5 / 22
Variational Inequality
Example: Saddle point Problem
Find x⋆
= (θ⋆
,φ⋆
) such that
L(θ⋆
,φ) ≤ L(θ⋆
,φ⋆
) ≤ L(θ,φ⋆
) for all θ ∈ Θ and all φ ∈ Φ.
X ≡ Θ × Φ and L X → R differentiable function.
Let V = (∇θL,−∇φL).
(SVI) ∀(θ,φ) ∈ X, ⟨∇θL(x⋆
),θ − θ⋆
⟩ − ⟨∇φL(x⋆
),φ − φ⋆
⟩ ≥ 0 [ stationary ]
(MVI) ∀(θ,φ) ∈ X, ⟨∇θL(x),θ − θ⋆
⟩ − ⟨∇φL(x),φ − φ⋆
⟩ ≥ 0 [ saddle point ]
If L is convex-concave, (SVI) and (MVI) are equivalent.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 6 / 22
Variational Inequality
Monoticity
The solutions of (SVI) and (MVI) coincide when V is continuous and monotone, i.e.,
⟨V (x′
) − V (x),x′
− x⟩ ≥ 0 for all x,x′
∈ Rd
.
In the above two examples, this corresponds to either f being convex or L being
convex-concave.
The operator analogue of strong convexity is strong monoticity
⟨V (x′
) − V (x),x′
− x⟩ ≥ α x′
− x 2
for some α > 0 and all x,x′
∈ Rd
.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
Extra-Gradient
Extra-Gradient
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
Extra-Gradient
From Forward-backward to Extra-Gradient
Forward-backward
Xt+1 = ΠX (Xt − γtV (Xt)) (FB)
Extra-Gradient [Korpelevich 1976]
Xt+ 1
2
= ΠX (Xt − γtV (Xt))
Xt+1 = ΠX (Xt − γtV (Xt+ 1
2
))
(EG)
The Extra-Gradient method anticipates the landscape of V by taking an extrapolation step
to reach the leading state Xt+ 1
2
.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 8 / 22
Extra-Gradient
From Forward-backward to Extra-Gradient
Forward-backward does not converge in bilinear games, while Extra-Gradient does.
min
θ∈R
max
φ∈R
θφ
Left:
Forward-backward
Right:
Extra-Gradient
4 2 0 2 4 6
6
4
2
0
2
4
6
0.5 0.0 0.5 1.0
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 9 / 22
Extra-Gradient
Stochastic Oracle
If a stochastic oracle is involved:
Xt+1
2
= ΠX (Xt − γt
ˆVt)
Xt+1 = ΠX (Xt − γt
ˆVt+1
2
)
With ˆVt = V (Xt) + Zt satisfying (and same for ˆVt+ 1
2
)
a) Zero-mean: E[Zt Ft] = 0.
b) Bounded variance: E[ Zt
2
Ft] ≤ σ2
.
(Ft)t∈N/2 is the natural filtration associated to the stochastic process (Xt)t∈N/2.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 10 / 22
Extra-Gradient
Convergence Metrics
Ergodic convergence: restricted error function
ErrR(ˆx) = max
x∈XR
⟨V (x), ˆx − x⟩,
where XR ≡ X ∩ BR(0) = {x ∈ X x ≤ R}.
Last iterate convergence: squared distance dist(ˆx,X⋆
)2
.
Lemma [Nesterov 2007]
Assume V is monotone. If x⋆
is a solution of (SVI), we have ErrR(x⋆
) = 0 for all
sufficiently large R. Conversely, if ErrR(ˆx) = 0 for large enough R > 0 and some ˆx ∈ XR,
then ˆx is a solution of (SVI).
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 11 / 22
Extra-Gradient
Literature Review
We further suppose that V is β-Lipschitz.
Convergence type Hypothesis
Korpelevich 1976 Last iterate asymptotic Pseudo monotone
Tseng 1995 Last iterate geometric
Monotone + error bound
(e.g., strongly monotone, affine)
Nemirovski 2004 Ergodic in O(1/t) Monotone
Juditsky et al. 2011 Ergodic in O(1/
√
t) Stochastic monotone
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 12 / 22
Extra-Gradient
In Deep Learning
Extra-Gradient (EG) needs two oracle calls per iteration, while gradient computations can
be very costly for deep models:
And if we drop one oracle call per iteration?
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
Single-call Extra-Gradient [Main Focus]
Single-call Extra-Gradient
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
Single-call Extra-Gradient [Main Focus]
Algorithms
1 Past Extra-Gradient [Popov 1980]
Xt+ 1
2
= ΠX (Xt − γt
ˆVt− 1
2
)
Xt+1 = ΠX (Xt − γt
ˆVt+ 1
2
)
(PEG)
2 Reflected Gradient [Malitsky 2015]
Xt+ 1
2
= Xt − (Xt−1 − Xt)
Xt+1 = ΠX (Xt − γt
ˆVt+ 1
2
)
(RG)
3 Optimistic Gradient [Daskalakis et al. 2018]
Xt+ 1
2
= ΠX (Xt − γt
ˆVt− 1
2
)
Xt+1 = Xt+1
2
+ γt
ˆVt−1
2
− γt
ˆVt+ 1
2
(OG)
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 14 / 22
Single-call Extra-Gradient [Main Focus]
A First Result
Proxy
PEG: [Step 1] ˆVt ← ˆVt− 1
2
RG: [Step 1] ˆVt ← (Xt−1 − Xt)/γt; no projection
OG: [Step 1] ˆVt ← ˆVt− 1
2
[Step 2] Xt ← Xt+ 1
2
+ γt
ˆVt− 1
2
; no projection
Proposition
Suppose that the Single-call Extra-Gradient (1-EG) methods presented above share the
same initialization, X0 = X1 ∈ X, ˆV1/2 = 0 and a same constant step-size (γt)t∈N ≡ γ. If
X = Rd
, the generated iterates Xt coincide for all t ≥ 1.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 15 / 22
Xt+1
2
= ΠX (Xt − γt
ˆVt)
Xt+1 = ΠX (Xt − γt
ˆVt+ 1
2
)
Single-call Extra-Gradient [Main Focus]
Global Convergence Rate
Always with Lipschitz continuity.
Stochastic strongly monotone: step size in O(1/t).
New results!
Monotone Strongly Monotone
Ergodic Last Iterate Ergodic Last Iterate
Deterministic 1/t Unknown 1/t e−ρt
Stochastic 1/
√
t Unknown 1/t 1/t
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 16 / 22
Single-call Extra-Gradient [Main Focus]
Proof Ingredients
Descent Lemma [Deterministic + Monotone]
There exists (µt)t∈N ∈ RN
+ such that for all p ∈ X,
Xt+1 − p 2
+ µt+1 ≤ Xt − p 2
− 2γ⟨V (Xt+1
2
),Xt+ 1
2
− p⟩ + µt.
Descent Lemma [Stochastic + Strongly Monotone]
Let x⋆
be the unique solution of (SVI). There exists (µt)t∈N ∈ RN
+ ,M ∈ R+ such that
E[ Xt+1 − x⋆ 2
] + µt+1 ≤ (1 − αγt)(E[ Xt − x⋆ 2
] + µt) + Mγ2
t σ2
.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 17 / 22
Single-call Extra-Gradient [Main Focus]
Regular Solution
Definition [Regular Solution]
We say that x⋆
is a regular solution of (SVI) if V is C1
-smooth in a neighborhood of x⋆
and the Jacobian JacV (x⋆
) is positive-definite along rays emanating from x⋆
, i.e.,
z⊺
JacV (x⋆
)z ≡
d
∑
i,j=1
zi
∂Vi
∂xj
(x⋆
)zj > 0 for all z ∈ Rd
∖{0} that are tangent to X at x⋆
.
To be compared with
positive definiteness of the Hessian along qualified constraints in minimization;
differential equilibrium in games.
Localization of strong monoticity.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 18 / 22
Single-call Extra-Gradient [Main Focus]
Local Convergence
Theorem [Local convergence for stochastic non-monotone operators]
Let x⋆
be a regular solution of (SVI) and fix a tolerance level δ > 0. Suppose (PEG) is run
with step-sizes of the form γt = γ/(t + b) for large enough γ and b. Then:
(a) There are neighborhoods U and U1 of x⋆
in X such that, if X1/2 ∈ U,X1 ∈ U1, the
event
E∞ = {Xt+ 1
2
∈ U for all t = 1,2,...}
occurs with probability at least 1 − δ.
(b) Conditioning on the above, we have:
E[ Xt − x⋆ 2
E∞] = O (
1
t
).
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 19 / 22
Single-call Extra-Gradient [Main Focus]
Experiments
L(θ,φ) = 2 1θ⊺
A1θ + 2(θ⊺
A2θ)
2
− 2 1φ⊺
B1φ − 2(φ⊺
B2φ)
2
+ 4θ⊺
Cφ
0 20 40 60 80 100
# Oracle Calls
10
−16
10
−13
10
−10
10
−7
10
−4
10
−1
∥x−x*∥2
EG
1-EG
γ = 0.1
γ = 0.2
γ = 0.3
γ = 0.4
(a)
Strongly monotone
( 1 = 1, 2 = 0)
Deterministic
Last iterate
10
0
10
1
10
2
10
3
# Oracle Calls
10
−3
10
−2
10
−1
10
0
∥x−x*∥2
EG
1-EG
γ = 0.2
γ = 0.4
γ = 0.6
γ = 0.8
(b)
Monotone ( 1 = 0, 2 = 1)
Deterministic
Ergodic
10
0
10
1
10
2
10
3
10
4
# Oracle Calls
10
−3
10
−2
10
−1
∥x−x*∥2
EG
1-EG
γ = 0.3
γ = 0.6
γ = 0.9
γ = 1.2
(c)
Non monotone ( 1 = 1, 2 = −1)
Stochastic Zt
iid
∼ N (0, σ
2
= .01)
Last iterate (b = 15)
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
Conclusion
Conclusion and Perspectives
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
Conclusion
Conclusion
Single-call rates ∼ Two-call rates.
Localization of stochastic guarantee.
Last iterate convergence: a first step to the non-monotone world.
Some research directions: Bregman, universal, . . .
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 21 / 22
Conclusion
Bibliography
Daskalakis, Constantinos et al. (2018). “Training GANs with optimism”. In: ICLR ’18: Proceedings of the
2018 International Conference on Learning Representations.
Juditsky, Anatoli, Arkadi Semen Nemirovski, and Claire Tauvel (2011). “Solving variational inequalities with
stochastic mirror-prox algorithm”. In: Stochastic Systems 1.1, pp. 17–58.
Korpelevich, G. M. (1976). “The extragradient method for finding saddle points and other problems”. In:
Èkonom. i Mat. Metody 12, pp. 747–756.
Malitsky, Yura (2015). “Projected reflected gradient methods for monotone variational inequalities”. In:
SIAM Journal on Optimization 25.1, pp. 502–520.
Nemirovski, Arkadi Semen (2004). “Prox-method with rate of convergence O(1/t) for variational inequalities
with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems”. In:
SIAM Journal on Optimization 15.1, pp. 229–251.
Nesterov, Yurii (2007). “Dual extrapolation and its applications to solving variational inequalities and related
problems”. In: Mathematical Programming 109.2, pp. 319–344.
Popov, Leonid Denisovich (1980). “A modification of the Arrow–Hurwicz method for search of saddle
points”. In: Mathematical Notes of the Academy of Sciences of the USSR 28.5, pp. 845–848.
Tseng, Paul (June 1995). “On linear convergence of iterative methods for the variational inequality problem”.
In: Journal of Computational and Applied Mathematics 60.1-2, pp. 237–252.
Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 22 / 22

On the Convergence of Single-Call Stochastic Extra-Gradient Methods

  • 1.
    On the Convergenceof Single-Call Stochastic Extra-Gradient Methods Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos NeurIPS, December 2019 Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 0 / 22
  • 2.
    Outline: 1 Variational Inequality 2Extra-Gradient 3 Single-call Extra-Gradient [Main Focus] 4 Conclusion Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
  • 3.
    Variational Inequality Variational Inequality Y-G.H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 1 / 22
  • 4.
    Variational Inequality Introduction: VariationalInequalities in Machine Learning Generative adversarial network (GAN) min θ max φ Ex∼pdata [log(Dφ(x))] + Ez∼pZ [log(1 − Dφ(Gθ(z)))]. More min-max (saddle point) problems: distributionally robust learning, primal-dual formulation in optimization, . . . Seach of equilibrium: games, multi-agent reinforcement learning, . . . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 2 / 22
  • 5.
    Variational Inequality Definition Stampacchia variationalinequality Find x⋆ ∈ X such that ⟨V (x⋆ ),x − x⋆ ⟩ ≥ 0 for all x ∈ X. (SVI) Minty variational inequality Find x⋆ ∈ X such that ⟨V (x),x − x⋆ ⟩ ≥ 0 for all x ∈ X. (MVI) With closed convex set X ⊆ Rd and vector field V Rd → Rd . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 3 / 22
  • 6.
    Variational Inequality Illustration SVI: V(x⋆ ) belongs to the dual cone DC(x⋆ ) of X at x⋆ [ local ] MVI: V (x) forms an acute angle with tangent vector x − x⋆ ∈ TC(x⋆ ) [ global ] Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 4 / 22
  • 7.
    Variational Inequality Example: FunctionMinimization min x f(x) subject to x ∈ X f X → R differentiable function to minimize. Let V = ∇f. (SVI) ∀x ∈ X, ⟨∇f(x⋆ ),x − x⋆ ⟩ ≥ 0 [ first-order optimality ] (MVI) ∀x ∈ X, ⟨∇f(x),x − x⋆ ⟩ ≥ 0 [x⋆ is a minimizer of f ] If f is convex, (SVI) and (MVI) are equivalent. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 5 / 22
  • 8.
    Variational Inequality Example: Saddlepoint Problem Find x⋆ = (θ⋆ ,φ⋆ ) such that L(θ⋆ ,φ) ≤ L(θ⋆ ,φ⋆ ) ≤ L(θ,φ⋆ ) for all θ ∈ Θ and all φ ∈ Φ. X ≡ Θ × Φ and L X → R differentiable function. Let V = (∇θL,−∇φL). (SVI) ∀(θ,φ) ∈ X, ⟨∇θL(x⋆ ),θ − θ⋆ ⟩ − ⟨∇φL(x⋆ ),φ − φ⋆ ⟩ ≥ 0 [ stationary ] (MVI) ∀(θ,φ) ∈ X, ⟨∇θL(x),θ − θ⋆ ⟩ − ⟨∇φL(x),φ − φ⋆ ⟩ ≥ 0 [ saddle point ] If L is convex-concave, (SVI) and (MVI) are equivalent. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 6 / 22
  • 9.
    Variational Inequality Monoticity The solutionsof (SVI) and (MVI) coincide when V is continuous and monotone, i.e., ⟨V (x′ ) − V (x),x′ − x⟩ ≥ 0 for all x,x′ ∈ Rd . In the above two examples, this corresponds to either f being convex or L being convex-concave. The operator analogue of strong convexity is strong monoticity ⟨V (x′ ) − V (x),x′ − x⟩ ≥ α x′ − x 2 for some α > 0 and all x,x′ ∈ Rd . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
  • 10.
    Extra-Gradient Extra-Gradient Y-G. H., F.I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 7 / 22
  • 11.
    Extra-Gradient From Forward-backward toExtra-Gradient Forward-backward Xt+1 = ΠX (Xt − γtV (Xt)) (FB) Extra-Gradient [Korpelevich 1976] Xt+ 1 2 = ΠX (Xt − γtV (Xt)) Xt+1 = ΠX (Xt − γtV (Xt+ 1 2 )) (EG) The Extra-Gradient method anticipates the landscape of V by taking an extrapolation step to reach the leading state Xt+ 1 2 . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 8 / 22
  • 12.
    Extra-Gradient From Forward-backward toExtra-Gradient Forward-backward does not converge in bilinear games, while Extra-Gradient does. min θ∈R max φ∈R θφ Left: Forward-backward Right: Extra-Gradient 4 2 0 2 4 6 6 4 2 0 2 4 6 0.5 0.0 0.5 1.0 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 9 / 22
  • 13.
    Extra-Gradient Stochastic Oracle If astochastic oracle is involved: Xt+1 2 = ΠX (Xt − γt ˆVt) Xt+1 = ΠX (Xt − γt ˆVt+1 2 ) With ˆVt = V (Xt) + Zt satisfying (and same for ˆVt+ 1 2 ) a) Zero-mean: E[Zt Ft] = 0. b) Bounded variance: E[ Zt 2 Ft] ≤ σ2 . (Ft)t∈N/2 is the natural filtration associated to the stochastic process (Xt)t∈N/2. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 10 / 22
  • 14.
    Extra-Gradient Convergence Metrics Ergodic convergence:restricted error function ErrR(ˆx) = max x∈XR ⟨V (x), ˆx − x⟩, where XR ≡ X ∩ BR(0) = {x ∈ X x ≤ R}. Last iterate convergence: squared distance dist(ˆx,X⋆ )2 . Lemma [Nesterov 2007] Assume V is monotone. If x⋆ is a solution of (SVI), we have ErrR(x⋆ ) = 0 for all sufficiently large R. Conversely, if ErrR(ˆx) = 0 for large enough R > 0 and some ˆx ∈ XR, then ˆx is a solution of (SVI). Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 11 / 22
  • 15.
    Extra-Gradient Literature Review We furthersuppose that V is β-Lipschitz. Convergence type Hypothesis Korpelevich 1976 Last iterate asymptotic Pseudo monotone Tseng 1995 Last iterate geometric Monotone + error bound (e.g., strongly monotone, affine) Nemirovski 2004 Ergodic in O(1/t) Monotone Juditsky et al. 2011 Ergodic in O(1/ √ t) Stochastic monotone Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 12 / 22
  • 16.
    Extra-Gradient In Deep Learning Extra-Gradient(EG) needs two oracle calls per iteration, while gradient computations can be very costly for deep models: And if we drop one oracle call per iteration? Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
  • 17.
    Single-call Extra-Gradient [MainFocus] Single-call Extra-Gradient Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 13 / 22
  • 18.
    Single-call Extra-Gradient [MainFocus] Algorithms 1 Past Extra-Gradient [Popov 1980] Xt+ 1 2 = ΠX (Xt − γt ˆVt− 1 2 ) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 ) (PEG) 2 Reflected Gradient [Malitsky 2015] Xt+ 1 2 = Xt − (Xt−1 − Xt) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 ) (RG) 3 Optimistic Gradient [Daskalakis et al. 2018] Xt+ 1 2 = ΠX (Xt − γt ˆVt− 1 2 ) Xt+1 = Xt+1 2 + γt ˆVt−1 2 − γt ˆVt+ 1 2 (OG) Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 14 / 22
  • 19.
    Single-call Extra-Gradient [MainFocus] A First Result Proxy PEG: [Step 1] ˆVt ← ˆVt− 1 2 RG: [Step 1] ˆVt ← (Xt−1 − Xt)/γt; no projection OG: [Step 1] ˆVt ← ˆVt− 1 2 [Step 2] Xt ← Xt+ 1 2 + γt ˆVt− 1 2 ; no projection Proposition Suppose that the Single-call Extra-Gradient (1-EG) methods presented above share the same initialization, X0 = X1 ∈ X, ˆV1/2 = 0 and a same constant step-size (γt)t∈N ≡ γ. If X = Rd , the generated iterates Xt coincide for all t ≥ 1. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 15 / 22 Xt+1 2 = ΠX (Xt − γt ˆVt) Xt+1 = ΠX (Xt − γt ˆVt+ 1 2 )
  • 20.
    Single-call Extra-Gradient [MainFocus] Global Convergence Rate Always with Lipschitz continuity. Stochastic strongly monotone: step size in O(1/t). New results! Monotone Strongly Monotone Ergodic Last Iterate Ergodic Last Iterate Deterministic 1/t Unknown 1/t e−ρt Stochastic 1/ √ t Unknown 1/t 1/t Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 16 / 22
  • 21.
    Single-call Extra-Gradient [MainFocus] Proof Ingredients Descent Lemma [Deterministic + Monotone] There exists (µt)t∈N ∈ RN + such that for all p ∈ X, Xt+1 − p 2 + µt+1 ≤ Xt − p 2 − 2γ⟨V (Xt+1 2 ),Xt+ 1 2 − p⟩ + µt. Descent Lemma [Stochastic + Strongly Monotone] Let x⋆ be the unique solution of (SVI). There exists (µt)t∈N ∈ RN + ,M ∈ R+ such that E[ Xt+1 − x⋆ 2 ] + µt+1 ≤ (1 − αγt)(E[ Xt − x⋆ 2 ] + µt) + Mγ2 t σ2 . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 17 / 22
  • 22.
    Single-call Extra-Gradient [MainFocus] Regular Solution Definition [Regular Solution] We say that x⋆ is a regular solution of (SVI) if V is C1 -smooth in a neighborhood of x⋆ and the Jacobian JacV (x⋆ ) is positive-definite along rays emanating from x⋆ , i.e., z⊺ JacV (x⋆ )z ≡ d ∑ i,j=1 zi ∂Vi ∂xj (x⋆ )zj > 0 for all z ∈ Rd ∖{0} that are tangent to X at x⋆ . To be compared with positive definiteness of the Hessian along qualified constraints in minimization; differential equilibrium in games. Localization of strong monoticity. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 18 / 22
  • 23.
    Single-call Extra-Gradient [MainFocus] Local Convergence Theorem [Local convergence for stochastic non-monotone operators] Let x⋆ be a regular solution of (SVI) and fix a tolerance level δ > 0. Suppose (PEG) is run with step-sizes of the form γt = γ/(t + b) for large enough γ and b. Then: (a) There are neighborhoods U and U1 of x⋆ in X such that, if X1/2 ∈ U,X1 ∈ U1, the event E∞ = {Xt+ 1 2 ∈ U for all t = 1,2,...} occurs with probability at least 1 − δ. (b) Conditioning on the above, we have: E[ Xt − x⋆ 2 E∞] = O ( 1 t ). Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 19 / 22
  • 24.
    Single-call Extra-Gradient [MainFocus] Experiments L(θ,φ) = 2 1θ⊺ A1θ + 2(θ⊺ A2θ) 2 − 2 1φ⊺ B1φ − 2(φ⊺ B2φ) 2 + 4θ⊺ Cφ 0 20 40 60 80 100 # Oracle Calls 10 −16 10 −13 10 −10 10 −7 10 −4 10 −1 ∥x−x*∥2 EG 1-EG γ = 0.1 γ = 0.2 γ = 0.3 γ = 0.4 (a) Strongly monotone ( 1 = 1, 2 = 0) Deterministic Last iterate 10 0 10 1 10 2 10 3 # Oracle Calls 10 −3 10 −2 10 −1 10 0 ∥x−x*∥2 EG 1-EG γ = 0.2 γ = 0.4 γ = 0.6 γ = 0.8 (b) Monotone ( 1 = 0, 2 = 1) Deterministic Ergodic 10 0 10 1 10 2 10 3 10 4 # Oracle Calls 10 −3 10 −2 10 −1 ∥x−x*∥2 EG 1-EG γ = 0.3 γ = 0.6 γ = 0.9 γ = 1.2 (c) Non monotone ( 1 = 1, 2 = −1) Stochastic Zt iid ∼ N (0, σ 2 = .01) Last iterate (b = 15) Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
  • 25.
    Conclusion Conclusion and Perspectives Y-G.H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 20 / 22
  • 26.
    Conclusion Conclusion Single-call rates ∼Two-call rates. Localization of stochastic guarantee. Last iterate convergence: a first step to the non-monotone world. Some research directions: Bregman, universal, . . . Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 21 / 22
  • 27.
    Conclusion Bibliography Daskalakis, Constantinos etal. (2018). “Training GANs with optimism”. In: ICLR ’18: Proceedings of the 2018 International Conference on Learning Representations. Juditsky, Anatoli, Arkadi Semen Nemirovski, and Claire Tauvel (2011). “Solving variational inequalities with stochastic mirror-prox algorithm”. In: Stochastic Systems 1.1, pp. 17–58. Korpelevich, G. M. (1976). “The extragradient method for finding saddle points and other problems”. In: Èkonom. i Mat. Metody 12, pp. 747–756. Malitsky, Yura (2015). “Projected reflected gradient methods for monotone variational inequalities”. In: SIAM Journal on Optimization 25.1, pp. 502–520. Nemirovski, Arkadi Semen (2004). “Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems”. In: SIAM Journal on Optimization 15.1, pp. 229–251. Nesterov, Yurii (2007). “Dual extrapolation and its applications to solving variational inequalities and related problems”. In: Mathematical Programming 109.2, pp. 319–344. Popov, Leonid Denisovich (1980). “A modification of the Arrow–Hurwicz method for search of saddle points”. In: Mathematical Notes of the Academy of Sciences of the USSR 28.5, pp. 845–848. Tseng, Paul (June 1995). “On linear convergence of iterative methods for the variational inequality problem”. In: Journal of Computational and Applied Mathematics 60.1-2, pp. 237–252. Y-G. H., F. I., J. M., P. M. Single-Call Extra-Gradient NeurIPS, December 2019 22 / 22