Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Riemannian stochastic variance reduced gradient
on Grassmann manifold
Hiroyuki Kasai†, Hiroyuki Sato§, and Bamdev Mishra††
†The University of Electro-Communications, Japan
§Tokyo University of Science, Japan
††Amazon Development Centre India, India
August 10, 2016
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 1

Summary (our contributions)
Address stochastic gradient descent (SGD) algorithm for
empirical risk minimization problem as
min
w∈Rd
n
i=1
fi(w).
Paritularlly, structured problems on manifolds, i.e., w ∈ M.
Propose Riemannian SVRG (R-SVRG).
Extend SVRG in the Euclidean into Riemannian manifolds.
Give two analyses;
Global convergence analysis, and
Local convergence rate analysis.
Show eﬀectiveness of R-SVRG from numerical comparisons.

Stochastic gradient method (SGD) (1)
Update in SGD
wk = wk−1
current point
− αk
single gradient for ik-th sample
(= stochastic gradient)
fik
random sample
(wk−1)
Unbiased expectation of full gradient as
E[ fi(w)] =
1
n
n
i=1
fi(w) = f(w).

Stochastic gradient descent (SGD) (2)
Features against full gradient descent (FGD)
Pros: High scalability to large-scale data
Iteration complexity is independent of n.
FGD shows linear complexity in n.
Cons: Slow convergence property
Decaying stepsizes for convergence to avoid
big ﬂuctuations around a solution due to a large step-size.
too slow convergence due to a too small step-size.
⇓
Sub-linear convergence rate E[f(wk)] − f(w∗
) ∈ O(k−1
).
FGD shows f(wk) − f(w∗
) ∈ O(ck
).

Speeding up of SGD: Variance reduction technique
Accelerate the convergence rate of SGD
[Mairal, 2015, Roux et al., 2012,
Shalev-Shwartz and Zhang, 2012,
Shalev-Shwartz and Zhang, 2013, Defazio et al., 2014,
Zhang and Xiao, 2014].
Stochastic variance reduced gradient (SVRG)
[Johnson and Zhang, 2013]
linear convergence rate for strongly-convex functions.
Various variants
[Garber and Hazan, 2015] analyze the convergence rate for
SVRG when f is a convex function that is a sum of
non-convex (but smooth) terms.
[Shalev-Shwartz, 2015] proposes similar results.
[Allen-Zhu and Yan, 2015] further study the same case with
better convergence rates.
[Shamir, 2015] studies speciﬁcally the convergence properties
of the variance reduction PCA algorithm.
Very recently, [Allen-Zhu and Hazan, 2016] propose a variance
reduction method for faster non-convex optimization.

Stochastic variance reduced gradient (SVRG) (1)
Motivations:
Reduce the variance of stochastic gradients.
No need to store all gradients not like SAG.
But, allow additional calculations of gradients.
Basic idea: hybrid algorithm of SGD and FGD.
Periodically, calculate and store a full gradient.
Every iteration, adjust a stochastic gradient v by the latest full
gradient to reduce variance.
⇓
Linear convergence rate
E[f( ˜ws)]−E[f( ˜w∗
)]≤αs
(E[f( ˜w0)]−E[f( ˜w∗
)])

Simpliﬁed algorithm of SVRG
1: Initial iterate w0
0 ∈ M.
2: for s = 1, 2, . . . (outer loop) do
3: Store ˜w = ws−1
t .
4: Store f( ˜w).
5: for t = 1, 2, . . . , ms (inner loop) do
6: Calculate
modiﬁed stochastic gradient
vs
t = fis
t
(ws
t−1)
single gradient at ws
t−1
−
single gradient
fis
t
( ˜w)+ f( ˜w).
full gradient
7: Update ws
t = ws
t−1 − αvs
t .
8: end for
9: end for

Structured problems
Examples
PCA problem: calculate the projection matrix U to minimize
as
min
U∈St(r,d)
1
n
n
i=1
xi − UUT
xi
2
2,
U belongs to Stiefel manifold St(r, d).
The set of matrices of size d × r with orthonormal columns,
i.e., UT
U = I.
⇓
Cost function remains unchanged under the orthogonal group
action U → UO for O ∈ O(r).
⇓
U belongs to Grassmann manifold Gr(r, d).
The set of r-dimensional linear subspaces in Rd
with
orthonormal columns, i.e., UT
U = I.
Other examples (not exchasted)
matrix completion, subspace tracking, spectral clustering,
CCA, bi-factor regression, ....Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 9

Optimization on Riemannian manifolds
[Absil et al., 2008]
If constraints can be deﬁned by a manifold, the constrained
problem is viewed as unconstrained problem on the manifold
as
min
w∈Rn
f(w), s.t. ci(w) = 0, cj(w) ≤ 0
⇓
min
w∈M
f(w), M : Riemannian manifold

Riemannian SGD (R-SGD) (1)
[Bonnabel, 2013]
Extension of Euclidean SGD into Riemannian manifolds.
Update in R-SGD
wk =
Move along geodesic
(by exponential mapping)
Expwk−1
(−αk gradfik
(wk−1)
Riemannian stochastic gradient
)
1. Calculate a Riemannian stochastic gradient gradfik
(wk−1) for
the sample ik at wk−1.
2. Then, move along the geodesic from wk−1 in the direction of
gradfik
(wk−1).
Geodesic is generalization of straight lines in Euclidean space.
Exponential mapping Expw(·) speciﬁes the geodesic.

Riemannian SGD (R-SGD) (2)
[Bonnabel, 2013]

Proposal: Riemannian SVRG (R-SVRG)
[Kasai et al., 2016]
Propose a novel extension of SVRG in the Euclidean space to
the Riemannian manifold search space.
Extension is not trivial.
Focus on the Grassmann manifold Gr(r, d).
Can be generalized to other compact Riemannian manifolds.
Notations
SVRG R-SVRG
Model parameter ws
t−1 ∈ Rn Us
t−1 ∈ Gr(r, d)
Edge point of outer loop ˜w ∈ Rn ˜U ∈ Gr(r, d)
Stochastic gradient fis
t
(ws
t−1) ∈ Rn gradfis
t
(Us
t−1) ∈ TUs
t−1
Gr(r, d)
Modiﬁed stochastic vs
t ∈ Rn ξs
t ∈ TUs
t−1
Gr(r, d)
gradient

Algorithm
Straightforward modiﬁcation of stochastic gradient
Extend SVRG case: vs
t = fis
t
(ws
t−1) − fit
( ˜w) + f( ˜w).
ξs
t = gradfis
t
(Us
t−1) − gradfis
t
(˜U) + gradf(˜U)
Meaningless because manifolds are not vector space.
⇓
Proposed modiﬁcation
Transport vectors at ˜U into the current tangent space at Us
t−1
by parallel translation, then add them.
ξs
t = gradfis
t
(Us
t−1)
+
parallel−translation operator
P
U
s
t−1←˜U
γ
geodesic
−gradfis
t
(˜U) + gradf(˜U)
Logarithm mapping gives the tangent vector for geodesic γ.

Conceptual illustration

Tools in Grassmann manifold
Exponential mapping in the direction of ξ ∈ TU(0)
U(t) = [U(0)V W]


cos tΣ
sin tΣ

 VT
,
ξ = WΣVT
is the rank-r singular value decomposition of ξ.
cos(·) and sin(·) operations are only on the diagonal entries.
Parallel translation of ζ ∈ TU(0) along γ(t) with ξ
ζ(t) =

[U(0)V W]


− sin tΣ
cos tΣ

 WT
+ (I − WWT
)

 ζ.
Logarithm mapping of U(t) at U(0)
ξ = logU(0)(U(t)) = W arctan(Σ)VT
,
WΣVT
is the rank-r singular value decomposition of
(U(t) − U(0)U(0)T
U(t))(U(0)T
U(t))−1
.

Main results: convergence analyses
Global convergence analysis with decaying step-sizes.
Guarantee that the iteration globally converges to a critical
point starting from any initialization point.
Local convergence rate analysis under ﬁxed step-size.
Consider the rate in neighborhood of a local minimum.
Assume that Lipschitz smoothness and lower bound of Hessian
hold only in this neighborhood.
Obtain local linear convergence rate as
E[(dist(˜U
s
, U∗
))2
] ≤
4(1 + 8mα2
β2
)
αm(σ − 14ηβ2)
E[(dist(˜U
s−1
, U∗
))2
].

Proof sketch for local convergence rate
1. Obtain below by assuming the smallest eigenvalue σ of
Hessian of f as
f(z) ≥ f(w) + Exp−1
w (z), gradf(w) w +
σ
2
Exp−1
w (z) 2
w, w, z ∈ U. (1)
2. Obtain the variance of ξs
t from β-Lipschitz continuity as
Eis
t
[ ξs
t
2
] ≤ β2
(14(dist(ws
t−1, w∗
))2
+ 8dist( ˜ws−1
, w∗
))2
) (2)
3. Obtain the expectation of the decrease of the distance to the
solution in the inner iteration from the lemma for a geodesic
triangle in an Alexandrov space as
Eis
t
(dist(Us
t , U∗
))2
− (dist(Us
t−1, U∗
))2
≤ Eis
t
[(dist(Us
t−1, Us
t ))2
+ 2η gradf(Us
t−1), Exp−1
Us
t−1
(U∗
) Us
t−1
]. (3)
4. Putting (1)&(2) into (3) with summing over the inner loop
ﬁnally yields the decrease of the distance to the solution in
the outer iteration.

Numerical comparisons
Experiments conditions
Compare R-SVRG with
1. R-SGD
2. R-SD (steepest descent) with backtracking line search
Step-size algorithms
1. ﬁxed step-size
2. decaying step-sizes
3. hybrid step-sizes
Use the decaying step-sizes at less than sT H (= 5) epoch, and
subsequently switches to a ﬁxed step-size.
PCA problem
n = 10000, d = 20, and r = 5.
Evaluation metrics
Optimality gap
Distance to the minimum loss obtained by Matlab pca.
Norm of gradient

Numerical comparisons
Results for PCA problem
#grad/N
0 50 100 150 200 250
Trainloss-optimum
10-10
10
-5
100
R-SD
R-SGD : decay (η=0.009, λ=0.1)
R-SVRG : fix (η=0.001)
R-SVRG : decay (η=0.001, λ=0.001)
R-SVRG : hybrid (η=0.004, λ=0.01)
R-SVRG+ : fix (η=0.001)
R-SVRG+ : decay (η=0.002, λ=0.01)
R-SVRG+ : hybrid (η=0.002, λ=0.01)
(a) Optimality gap.
#grad/N
0 50 100 150 200 250
Normofgradient
10-5
10
0
R-SD
R-SGD : decay (η=0.009, λ=0.1)
R-SVRG : fix (η=0.001)
R-SVRG : decay (η=0.001, λ=0.001)
R-SVRG : hybrid (η=0.004, λ=0.01)
R-SVRG+ : fix (η=0.001)
R-SVRG+ : decay (η=0.002, λ=0.01)
R-SVRG+ : hybrid (η=0.002, λ=0.01)
(b) Norm of gradient.

Conclusions and more information
Conclusions
Propose Riemannian SVRG (R-SVRG).
R-SVRG shows local linear convergence rate.
Numerical comparisons shows the eﬀectiveness of the
algorithm.
More information
Full paper
H.Kasai, H.Sato and B.Mishra, ”Riemannian stochastic
variance reduced gradient on Grassmann manifold,”
arXiv:1605.07367, May 2016, [Kasai et al., 2016]
Matlab code
https://bamdevmishra.com/codes/rsvrg/
Thank you for your attention.

References I
Absil, P.-A., Mahony, R., and Sepulchre, R. (2008).
Optimization Algorithms on Matrix Manifolds.
Princeton University Press.
Allen-Zhu, Z. and Hazan, E. (2016).
Variance reduction for faster non-convex optimization.
Technical report, arXiv preprint arXiv:1603.05643.
Allen-Zhu, Z. and Yan, Y. (2015).
Improved SVRG for non-strongly-convex or sum-of-non-convex objectives.
Bonnabel, S. (2013).
Stochastic gradient descent on Riemannian manifolds.
IEEE Trans. on Automatic Control, 58(9):2217–2229.
Defazio, A., Bach, F., and Lacoste-Julien, S. (2014).
SAGA: A fast incremental gradient method with support for non-strongly convex
composite objectives.
In NIPS.
Garber, D. and Hazan, E. (2015).
Fast and simple PCA via convex optimization.

References II
Johnson, R. and Zhang, T. (2013).
Accelerating stochastic gradient descent using predictive variance reduction.
In NIPS, pages 315–323.
Kasai, H., Sato, H., and Mishra, B. (2016).
Riemannian stochastic variance reduced gradient on grassmann manifold.
arXiv preprint: arXiv:1605.07367.
Mairal, J. (2015).
Incremental majorization-minimization optimization with application to largescale
machine learning.
SIAM J. Optim., 25(2):829–855.
Roux, N. L., Schmidt, M., and Bach, F. R. (2012).
A stochastic gradient method with an exponential convergence rate for ﬁnite
training sets.
In NIPS, pages 2663–2671.
Shalev-Shwartz, S. (2015).
SDCA without duality.
Shalev-Shwartz, S. and Zhang, T. (2012).
Proximal stochastic dual coordinate ascent.

References III
Shalev-Shwartz, S. and Zhang, T. (2013).
Stochastic dual coordinate ascent methods for regularized loss minimization.
JMRL, 14:567–599.
Shamir, O. (2015).
Fast stochastic algorithms for SVD and PCA: Convergence properties and
convexity.
Zhang, Y. and Xiao, L. (2014).
Stochastic primal-dual coordinate method for regularized empirical risk
minimization.
SIAM J. Optim., 24(4):2057–2075.

Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Similar to Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016) (20)

Recently uploaded

Recently uploaded (20)

Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)