Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large, but finite, number of loss functions. In this paper, we propose a novel Riemannian extension of the Euclidean stochastic variance reduced gradient algorithm (R-SVRG) to a compact manifold search space. To this end, we show the developments on the Grassmann manifold. The key challenges of averaging, addition, and subtraction of multiple gradients are addressed with notions like logarithm mapping and parallel translation of vectors on the Grassmann manifold. We present a global convergence analysis of the proposed algorithm with a decay step-size and a local convergence rate analysis under a fixed step-size with under some natural assumptions. The proposed algorithm is applied on a number of problems on the Grassmann manifold like principal components analysis, low-rank matrix completion, and the Karcher mean computation. In all these cases, the proposed algorithm outperforms the standard Riemannian stochastic gradient descent algorithm.
Aspirational Block Program Block Syaldey District - Almora
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)
1. Riemannian stochastic variance reduced gradient
on Grassmann manifold
Hiroyuki Kasai†, Hiroyuki Sato§, and Bamdev Mishra††
†The University of Electro-Communications, Japan
§Tokyo University of Science, Japan
††Amazon Development Centre India, India
August 10, 2016
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 1
2. Summary (our contributions)
Address stochastic gradient descent (SGD) algorithm for
empirical risk minimization problem as
min
w∈Rd
n
i=1
fi(w).
Paritularlly, structured problems on manifolds, i.e., w ∈ M.
Propose Riemannian SVRG (R-SVRG).
Extend SVRG in the Euclidean into Riemannian manifolds.
Give two analyses;
Global convergence analysis, and
Local convergence rate analysis.
Show effectiveness of R-SVRG from numerical comparisons.
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 2
3. Stochastic gradient method (SGD) (1)
Update in SGD
wk = wk−1
current point
− αk
single gradient for ik-th sample
(= stochastic gradient)
fik
random sample
(wk−1)
Unbiased expectation of full gradient as
E[ fi(w)] =
1
n
n
i=1
fi(w) = f(w).
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 3
4. Stochastic gradient descent (SGD) (2)
Features against full gradient descent (FGD)
Pros: High scalability to large-scale data
Iteration complexity is independent of n.
FGD shows linear complexity in n.
Cons: Slow convergence property
Decaying stepsizes for convergence to avoid
big fluctuations around a solution due to a large step-size.
too slow convergence due to a too small step-size.
⇓
Sub-linear convergence rate E[f(wk)] − f(w∗
) ∈ O(k−1
).
FGD shows f(wk) − f(w∗
) ∈ O(ck
).
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 4
5. Speeding up of SGD: Variance reduction technique
Accelerate the convergence rate of SGD
[Mairal, 2015, Roux et al., 2012,
Shalev-Shwartz and Zhang, 2012,
Shalev-Shwartz and Zhang, 2013, Defazio et al., 2014,
Zhang and Xiao, 2014].
Stochastic variance reduced gradient (SVRG)
[Johnson and Zhang, 2013]
linear convergence rate for strongly-convex functions.
Various variants
[Garber and Hazan, 2015] analyze the convergence rate for
SVRG when f is a convex function that is a sum of
non-convex (but smooth) terms.
[Shalev-Shwartz, 2015] proposes similar results.
[Allen-Zhu and Yan, 2015] further study the same case with
better convergence rates.
[Shamir, 2015] studies specifically the convergence properties
of the variance reduction PCA algorithm.
Very recently, [Allen-Zhu and Hazan, 2016] propose a variance
reduction method for faster non-convex optimization.
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 5
6. Stochastic variance reduced gradient (SVRG) (1)
[Johnson and Zhang, 2013]
Motivations:
Reduce the variance of stochastic gradients.
No need to store all gradients not like SAG.
But, allow additional calculations of gradients.
Basic idea: hybrid algorithm of SGD and FGD.
Periodically, calculate and store a full gradient.
Every iteration, adjust a stochastic gradient v by the latest full
gradient to reduce variance.
⇓
Linear convergence rate
E[f( ˜ws)]−E[f( ˜w∗
)]≤αs
(E[f( ˜w0)]−E[f( ˜w∗
)])
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 6
7. Stochastic variance reduced gradient (SVRG) (2)
Simplified algorithm of SVRG
1: Initial iterate w0
0 ∈ M.
2: for s = 1, 2, . . . (outer loop) do
3: Store ˜w = ws−1
t .
4: Store f( ˜w).
5: for t = 1, 2, . . . , ms (inner loop) do
6: Calculate
modified stochastic gradient
vs
t = fis
t
(ws
t−1)
single gradient at ws
t−1
−
single gradient
fis
t
( ˜w)+ f( ˜w).
full gradient
7: Update ws
t = ws
t−1 − αvs
t .
8: end for
9: end for
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 7
8. Stochastic variance reduced gradient (SVRG) (3)
[Johnson and Zhang, 2013]
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 8
9. Structured problems
Examples
PCA problem: calculate the projection matrix U to minimize
as
min
U∈St(r,d)
1
n
n
i=1
xi − UUT
xi
2
2,
U belongs to Stiefel manifold St(r, d).
The set of matrices of size d × r with orthonormal columns,
i.e., UT
U = I.
⇓
Cost function remains unchanged under the orthogonal group
action U → UO for O ∈ O(r).
⇓
U belongs to Grassmann manifold Gr(r, d).
The set of r-dimensional linear subspaces in Rd
with
orthonormal columns, i.e., UT
U = I.
Other examples (not exchasted)
matrix completion, subspace tracking, spectral clustering,
CCA, bi-factor regression, ....Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 9
10. Optimization on Riemannian manifolds
[Absil et al., 2008]
If constraints can be defined by a manifold, the constrained
problem is viewed as unconstrained problem on the manifold
as
min
w∈Rn
f(w), s.t. ci(w) = 0, cj(w) ≤ 0
⇓
min
w∈M
f(w), M : Riemannian manifold
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 10
11. Riemannian SGD (R-SGD) (1)
[Bonnabel, 2013]
Extension of Euclidean SGD into Riemannian manifolds.
Update in R-SGD
wk =
Move along geodesic
(by exponential mapping)
Expwk−1
(−αk gradfik
(wk−1)
Riemannian stochastic gradient
)
1. Calculate a Riemannian stochastic gradient gradfik
(wk−1) for
the sample ik at wk−1.
2. Then, move along the geodesic from wk−1 in the direction of
gradfik
(wk−1).
Geodesic is generalization of straight lines in Euclidean space.
Exponential mapping Expw(·) specifies the geodesic.
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 11
12. Riemannian SGD (R-SGD) (2)
[Bonnabel, 2013]
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 12
13. Proposal: Riemannian SVRG (R-SVRG)
[Kasai et al., 2016]
Propose a novel extension of SVRG in the Euclidean space to
the Riemannian manifold search space.
Extension is not trivial.
Focus on the Grassmann manifold Gr(r, d).
Can be generalized to other compact Riemannian manifolds.
Notations
SVRG R-SVRG
Model parameter ws
t−1 ∈ Rn Us
t−1 ∈ Gr(r, d)
Edge point of outer loop ˜w ∈ Rn ˜U ∈ Gr(r, d)
Stochastic gradient fis
t
(ws
t−1) ∈ Rn gradfis
t
(Us
t−1) ∈ TUs
t−1
Gr(r, d)
Modified stochastic vs
t ∈ Rn ξs
t ∈ TUs
t−1
Gr(r, d)
gradient
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 13
14. Proposal: Riemannian SVRG (R-SVRG)
Algorithm
Straightforward modification of stochastic gradient
Extend SVRG case: vs
t = fis
t
(ws
t−1) − fit
( ˜w) + f( ˜w).
ξs
t = gradfis
t
(Us
t−1) − gradfis
t
(˜U) + gradf(˜U)
Meaningless because manifolds are not vector space.
⇓
Proposed modification
Transport vectors at ˜U into the current tangent space at Us
t−1
by parallel translation, then add them.
ξs
t = gradfis
t
(Us
t−1)
+
parallel−translation operator
P
U
s
t−1←˜U
γ
geodesic
−gradfis
t
(˜U) + gradf(˜U)
Logarithm mapping gives the tangent vector for geodesic γ.
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 14
15. Proposal: Riemannian SVRG (R-SVRG)
Conceptual illustration
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 15
16. Tools in Grassmann manifold
Exponential mapping in the direction of ξ ∈ TU(0)
U(t) = [U(0)V W]
cos tΣ
sin tΣ
VT
,
ξ = WΣVT
is the rank-r singular value decomposition of ξ.
cos(·) and sin(·) operations are only on the diagonal entries.
Parallel translation of ζ ∈ TU(0) along γ(t) with ξ
ζ(t) =
[U(0)V W]
− sin tΣ
cos tΣ
WT
+ (I − WWT
)
ζ.
Logarithm mapping of U(t) at U(0)
ξ = logU(0)(U(t)) = W arctan(Σ)VT
,
WΣVT
is the rank-r singular value decomposition of
(U(t) − U(0)U(0)T
U(t))(U(0)T
U(t))−1
.
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 16
17. Main results: convergence analyses
Global convergence analysis with decaying step-sizes.
Guarantee that the iteration globally converges to a critical
point starting from any initialization point.
Local convergence rate analysis under fixed step-size.
Consider the rate in neighborhood of a local minimum.
Assume that Lipschitz smoothness and lower bound of Hessian
hold only in this neighborhood.
Obtain local linear convergence rate as
E[(dist(˜U
s
, U∗
))2
] ≤
4(1 + 8mα2
β2
)
αm(σ − 14ηβ2)
E[(dist(˜U
s−1
, U∗
))2
].
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 17
18. Proof sketch for local convergence rate
1. Obtain below by assuming the smallest eigenvalue σ of
Hessian of f as
f(z) ≥ f(w) + Exp−1
w (z), gradf(w) w +
σ
2
Exp−1
w (z) 2
w, w, z ∈ U. (1)
2. Obtain the variance of ξs
t from β-Lipschitz continuity as
Eis
t
[ ξs
t
2
] ≤ β2
(14(dist(ws
t−1, w∗
))2
+ 8dist( ˜ws−1
, w∗
))2
) (2)
3. Obtain the expectation of the decrease of the distance to the
solution in the inner iteration from the lemma for a geodesic
triangle in an Alexandrov space as
Eis
t
(dist(Us
t , U∗
))2
− (dist(Us
t−1, U∗
))2
≤ Eis
t
[(dist(Us
t−1, Us
t ))2
+ 2η gradf(Us
t−1), Exp−1
Us
t−1
(U∗
) Us
t−1
]. (3)
4. Putting (1)&(2) into (3) with summing over the inner loop
finally yields the decrease of the distance to the solution in
the outer iteration.
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 18
19. Numerical comparisons
Experiments conditions
Compare R-SVRG with
1. R-SGD
2. R-SD (steepest descent) with backtracking line search
Step-size algorithms
1. fixed step-size
2. decaying step-sizes
3. hybrid step-sizes
Use the decaying step-sizes at less than sT H (= 5) epoch, and
subsequently switches to a fixed step-size.
PCA problem
n = 10000, d = 20, and r = 5.
Evaluation metrics
Optimality gap
Distance to the minimum loss obtained by Matlab pca.
Norm of gradient
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 19
21. Conclusions and more information
Conclusions
Propose Riemannian SVRG (R-SVRG).
R-SVRG shows local linear convergence rate.
Numerical comparisons shows the effectiveness of the
algorithm.
More information
Full paper
H.Kasai, H.Sato and B.Mishra, ”Riemannian stochastic
variance reduced gradient on Grassmann manifold,”
arXiv:1605.07367, May 2016, [Kasai et al., 2016]
Matlab code
https://bamdevmishra.com/codes/rsvrg/
Thank you for your attention.
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 21
22. References I
Absil, P.-A., Mahony, R., and Sepulchre, R. (2008).
Optimization Algorithms on Matrix Manifolds.
Princeton University Press.
Allen-Zhu, Z. and Hazan, E. (2016).
Variance reduction for faster non-convex optimization.
Technical report, arXiv preprint arXiv:1603.05643.
Allen-Zhu, Z. and Yan, Y. (2015).
Improved SVRG for non-strongly-convex or sum-of-non-convex objectives.
Technical report, arXiv preprint arXiv:1506.01972.
Bonnabel, S. (2013).
Stochastic gradient descent on Riemannian manifolds.
IEEE Trans. on Automatic Control, 58(9):2217–2229.
Defazio, A., Bach, F., and Lacoste-Julien, S. (2014).
SAGA: A fast incremental gradient method with support for non-strongly convex
composite objectives.
In NIPS.
Garber, D. and Hazan, E. (2015).
Fast and simple PCA via convex optimization.
Technical report, arXiv preprint arXiv:1509.05647.
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 22
23. References II
Johnson, R. and Zhang, T. (2013).
Accelerating stochastic gradient descent using predictive variance reduction.
In NIPS, pages 315–323.
Kasai, H., Sato, H., and Mishra, B. (2016).
Riemannian stochastic variance reduced gradient on grassmann manifold.
arXiv preprint: arXiv:1605.07367.
Mairal, J. (2015).
Incremental majorization-minimization optimization with application to largescale
machine learning.
SIAM J. Optim., 25(2):829–855.
Roux, N. L., Schmidt, M., and Bach, F. R. (2012).
A stochastic gradient method with an exponential convergence rate for finite
training sets.
In NIPS, pages 2663–2671.
Shalev-Shwartz, S. (2015).
SDCA without duality.
Technical report, arXiv preprint arXiv:1502.06177.
Shalev-Shwartz, S. and Zhang, T. (2012).
Proximal stochastic dual coordinate ascent.
Technical report, arXiv preprint arXiv:1211.2717.
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 23
24. References III
Shalev-Shwartz, S. and Zhang, T. (2013).
Stochastic dual coordinate ascent methods for regularized loss minimization.
JMRL, 14:567–599.
Shamir, O. (2015).
Fast stochastic algorithms for SVD and PCA: Convergence properties and
convexity.
Technical report, arXiv preprint arXiv:1507.08788.
Zhang, Y. and Xiao, L. (2014).
Stochastic primal-dual coordinate method for regularized empirical risk
minimization.
SIAM J. Optim., 24(4):2057–2075.
Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 24