This document summarizes a new method for solving regularized empirical risk minimization problems in mini-batch settings. The proposed method, called Doubly Accelerated Stochastic Variance Reduced Gradient, combines inner and outer acceleration to improve the mini-batch efficiency of previous methods like SVRG and AccProxSVRG. It achieves this by applying Nesterov's acceleration both within and across iterations of the AccProxSVRG algorithm. Numerical experiments demonstrate that the new method requires a smaller mini-batch size to achieve a given optimization error compared to prior methods.
On Twisted Paraproducts and some other Multilinear Singular IntegralsVjekoslavKovac1
Presentation.
9th International Conference on Harmonic Analysis and Partial Differential Equations, El Escorial, June 12, 2012.
The 24th International Conference on Operator Theory, Timisoara, July 3, 2012.
On maximal and variational Fourier restrictionVjekoslavKovac1
Workshop talk slides, Follow-up workshop to trimester program "Harmonic Analysis and Partial Differential Equations", Hausdorff Institute, Bonn, May 2019.
We have implemented a multiple precision ODE solver based on high-order fully implicit Runge-Kutta(IRK) methods. This ODE solver uses any order Gauss type formulas, and can be accelerated by using (1) MPFR as multiple precision floating-point arithmetic library, (2) real tridiagonalization supported in SPARK3, of linear equations to be solved in simplified Newton method as inner iteration, (3) mixed precision iterative refinement method\cite{mixed_prec_iterative_ref}, (4) parallelization with OpenMP, and (5) embedded formulas for IRK methods. In this talk, we describe the reason why we adopt such accelerations, and show the efficiency of the ODE solver through numerical experiments such as Kuramoto-Sivashinsky equation.
On Twisted Paraproducts and some other Multilinear Singular IntegralsVjekoslavKovac1
Presentation.
9th International Conference on Harmonic Analysis and Partial Differential Equations, El Escorial, June 12, 2012.
The 24th International Conference on Operator Theory, Timisoara, July 3, 2012.
On maximal and variational Fourier restrictionVjekoslavKovac1
Workshop talk slides, Follow-up workshop to trimester program "Harmonic Analysis and Partial Differential Equations", Hausdorff Institute, Bonn, May 2019.
We have implemented a multiple precision ODE solver based on high-order fully implicit Runge-Kutta(IRK) methods. This ODE solver uses any order Gauss type formulas, and can be accelerated by using (1) MPFR as multiple precision floating-point arithmetic library, (2) real tridiagonalization supported in SPARK3, of linear equations to be solved in simplified Newton method as inner iteration, (3) mixed precision iterative refinement method\cite{mixed_prec_iterative_ref}, (4) parallelization with OpenMP, and (5) embedded formulas for IRK methods. In this talk, we describe the reason why we adopt such accelerations, and show the efficiency of the ODE solver through numerical experiments such as Kuramoto-Sivashinsky equation.
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
Stochastic optimal control problems arise in many
applications and are, in principle,
large-scale involving up to millions of decision variables. Their
applicability in control applications is often limited by the
availability of algorithms that can solve them efficiently and within
the sampling time of the controlled system.
In this paper we propose a dual accelerated proximal
gradient algorithm which is amenable to parallelization and
demonstrate that its GPU implementation affords high speed-up
values (with respect to a CPU implementation) and greatly outperforms
well-established commercial optimizers such as Gurobi.
This is the entrance exam paper for ISI MSQE Entrance Exam for the year 2008. Much more information on the ISI MSQE Entrance Exam and ISI MSQE Entrance preparation help available on http://crackdse.com
Opening of our Deep Learning Lunch & Learn series. First session: introduction to Neural Networks, Gradient descent and backpropagation, by Pablo J. Villacorta, with a prologue by Fernando Velasco
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Yandex
We consider a new class of huge-scale problems, the problems with sparse subgradients. The most important functions of this type are piecewise linear. For optimization problems with uniform sparsity of corresponding linear operators, we suggest a very efficient implementation of subgradient iterations, the total cost of which depends logarithmically in the dimension. This technique is based on a recursive update of the results of matrix/vector products and the values of symmetric functions. It works well, for example, for matrices with few nonzero diagonals and for max-type functions.
We show that the updating technique can be efficiently coupled with the simplest subgradient methods. Similar results can be obtained for a new non-smooth random variant of a coordinate descent scheme. We also present promising results of preliminary computational experiments.
Phase Retrieval: Motivation and TechniquesVaibhav Dixit
This presentation describes two techniques namely Transport of Intensity Equation(TIE) technique and Phase Diversity technique for retrieving phase information from light.
Similar to Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regularized Empirical Risk Minimization (20)
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
1. Doubly Accelerated
Stochastic Variance Reduced Gradient Methods
for Regularized Empirical Risk Minimization
Tomoya Murata†
, Taiji Suzukiद
†NTT DATA Mathematical Systems Inc., ‡The University of Tokyo, §RIKEN, ¶PRESTO
Jan. 13, 2018
1 / 39
2. This Presentation
Murata and Suzuki:
Doubly Accelerated Stochastic Variance Reduced Dual Averaging
Method for Regularized Empirical Risk Minimization, NIPS 2017
+ some extensions
2 / 39
3. Overview
What:
New methods for solving convex composite optimization in
mini-batch settings
Main result:
Improvement of the mini-batch efficiency of previous methods
− Mini-batch efficiency
: We say that A is more mini-batch efficient than B, if A’s
necessary mini-batch size for achieving given iteration complexity
is smaller than B’s.
− Iteration complexity
: Necessary number of parameter updates to achieve a desired
optimization error
3 / 39
5. Smoothness
Definition :
We say that f : Rd
→ R is (L, ℓ)-smooth (L > 0) if
−
ℓ
2
∥x − y∥2
≤ f(x) − f(y) − ⟨∇f(y), x − y⟩ ≤
L
2
∥x − y∥2
.
− Lower smoothness ℓ ≤ 0 implies (strong) convexity of f
− Lower smoothness ℓ > 0 implies non-convexity of f
5 / 39
6. Convex Composite Optimization
Focus of this presentation:
min
x∈Rd
{P(x)
def
= F(x) + R(x)
def
= 1
n
∑n
i=1 fi(x) + R(x)}
F: (L, −µ)-smooth (L > 0, µ > 0) (i.e., µ-strongly convex)
fi: (L, ℓ)-smooth (L > 0, ℓ ≥ 0) (i.e., generally nonconvex)
R: simple and (possibly) non-differentiable convex
6 / 39
7. Examples (ℓ = 0)
(a1, b1), . . . , (an, bn) ∈ Rd
× R: traning set.
Lasso:
fi(x)
def
=
1
2
(a⊤
i x − bi)2
, R(x)
def
= λ∥x∥1
Elastic Net logistic regression:
fi(x)
def
= log(1 + exp(−bia⊤
i x)) +
λ2
2
∥x∥2
2, R(x)
def
= λ1∥x∥1
Support vector machines:
fi(x)
def
= ¯hν
i (a⊤
i x) +
λ
2
∥x∥2
2, R(x)
def
= 0
− ¯hν
i : smoothed variant of hinge loss hi(u)
def
= max{0, 1 − biu}
7 / 39
8. Examples (ℓ > 0)
Recently, Carmon et al. (2016), Allen-Zhu and Li (2017) and Yu
et al. (2017) have proposed algorithms for finding second-order
stationary points of smooth non-convex objectives.
− x is a (ε, δ)-second-order stationary point of f
def
⇔ ∥∇f(x)∥2
≤ ε and ∇2
f(x) ⪰ −δ
These algorithms are essentially based on two building blocks:
finding a first-order stationary point
finding a direction of the objective that has negative curvature
For exploiting negative curvature, these algorithms compute the
minimum eigenvector of the hessian.
http://bair.berkeley.edu/blog/2017/08/31/saddle-efficiency/
8 / 39
9. Fast eigenvector computation:
Recently, Garber et al. (2016) has proposed a noble method for
finding approximate eigenvectors using convex optimization.
Essential subproblem :
min
z∈Rd
{g(z)
def
= 1
n
∑n
i=1 gi(z)
def
= 1
n
∑n
i=1
1
2
z⊤
(λ + ∇2
fi(x0))z − ⟨y, z⟩}
− λ > λmin(∇2
F(x0)) is assumed
− z∗ = (λ + ∇2
F(x0))−1
y
g is (λ + λmax(∇2
F(x0)), −(λ − λmin(∇2
F(x0)))-smooth
gi is (λ + λmax(∇2
fi(x0)), −(λ − λmin(∇2
fi(x0)))-smooth
Note that generally −(λ − λmin(∇2
fi(x0))) > 0, even though
−(λ − λmin(∇2
F(x0)) < 0.
9 / 39
10. Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
10 / 39
11. Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
Today’s focus
11 / 39
12. Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
12 / 39
13. SVRG [Johnson and Zhang (2013); Xiao and Zhang (2014)]
(Proximal) Stochastic Variace Reduced Gradient
= SGD + Variance Reduction
SVRG(x0, η, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
xs = One Stage SVRG(xs−1, η, m, b)
Output: xS.
One Stage SVRG(x0, η, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
vk = 1
b
∑
i∈Ik
(∇fi(xk−1) − ∇fi(x0)) + ∇F(x0).
xk = proxηR(xk−1 − ηvk).
Output: 1
m
∑m
k=1 xk.
13 / 39
14. vk = ∇fIk
(xk−1) − ∇fIk
(x0) + ∇F(x0)
Main Idea: Usage of vk as an unbiased estimator of ∇F(xk−1)
− V[vk] → 0 as xk−1, x0 → x∗
− Computaional cost per inner iteration is same as SGD’s
x0 (initial)
xk−1 (current)
xk (next)
∇F(x0)
∇fIk
(x0)
∇F(xk−1)
∇fIk
(xk−1)
vk
14 / 39
15. Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SGD O
(
L
ε
+ 1
bµε
)
O
(
L
ε
+ 1
bµε
)
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error
− Linear convergence
− Limit in mini-batch settings: SVRG requires at least
O
(
L
µ
log(1
ε
)
)
for any mini-batch size b
Questions:
The mini-batch efficiency of SVRG is improvable?
By Nesterov’s method SVRG can be accelerated?
15 / 39
16. Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
16 / 39
17. AccProxSVRG [Nitanda (2014)]
Accelerated Proximal SVRG = SVRG + Inner Acceleration
AccProxSVRG(x0, η, β, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
xs = One Stage AccProxSVRG(xs−1, η, β, m, b).
Output: xS.
One Stage AccProxSVRG(x0, η, β, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
yk = xk−1 + β(xk−1 − xk−2).
vk = 1
b
∑
i∈Ik
(∇fi(yk) − ∇fi(x0)) + ∇F(x0).
xk = proxηR(yk − ηvk).
Output: xm.
17 / 39
18. yk = xk−1 + β(xk−1 − xk−2)
Main Idea: Usage of Nesterov’s momentum in each inner iteration
xk−2 (previous)
yk−1 (previous)
xk−1 (current)
yk (current)
xk (next)
Momentum
18 / 39
19. Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
AccProxSVRG O
((
n
b
+ L
bµ
+
√
L
µ
)
log
(1
ε
))
No analysis
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error
− Linear speed up w.r.t mini-batch size b:
L
µ
(SVRG) → L
bµ
(AccProxSVRG)
− No acceleration in non-mini-batch settings: the rate of
AccProxSVRG is same as the one of SVRG when b = 1
Question:
The identical rate between AccProxSVRG’s and SVRG’s in
non-mini-batch settings is improvable?
19 / 39
20. Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
20 / 39
21. Universal Catalyst [Lin et al. (2015)]
Universal Catalyst: a generic acceleration framework
Given an non-accelerated algorithm M (for example, SVRG),
UC(ˇx0, κ, {βt}, {εt}, T)
Iterating the following for t = 1, 2, . . . , T:
ˇyt = ˇxt−1 + βt(ˇxt−1 − ˇxt−2).
Define Gt(x) = P(x) + κ
2
∥x − ˇyt∥2
2.
ˇxt ≈ argminx∈Rd Gt(x) s.t. Gt(ˇxt) − G∗
t ≤ εt by M.
Output: ˇxT .
Main Idea: Running IAPPA and solving each subproblem by M
− UC can be regard as an application of Inexact Accelerated PPA
(PPA: Proximal Point Algorithm).
21 / 39
22. Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
UC+SVRG O
((
n
b
+
√
nL
bµ
)
log
(1
ε
))
O
((
n
b
+
√
nL
bµ
+ n
3
4
b
√
(Lℓ)
1
2
µ
)
log
(1
ε
)
)
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error, O hides extra log-factors
− Accelerated rate: L
µ
(SVRG) →
√
nL
bµ
(UC +SVRG)
− Sublinear speed up w.r.t mini-batch size b: not sufficient
− Katyusha also achieves the same rate
Practicality:
Hardness of tuning stopping criterions of subproblems
Many tuning parameters
Question:
The dependency on mini-batch size b is improvable?
22 / 39
23. Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
23 / 39
24. Core Ideas
Double acceleration:
Combining Inner Acceleration and Outer Acceleration
Two approaches:
Applying UC to AccProxSVRG
Directly applying Nesterov’s acceleration to the outer iterations
of AccProxSVRG
The latter algorithm is more direct and practical.
24 / 39
25. Proposed Algorithm
Doubly Accelerated Stochastic Variance Reduced Dual Averaging
= (SVRDA + Inner Acceleration) + Outer Acceleration
DASVRDAsc
(ˇx0, η, m, b, S, T)
Iterating the following for t = 1, 2, . . . , T:
ˇxt = DASVRDAns
(ˇxt−1, η, m, b, S).
Output: ˇxT .
DASVRDAns
(x0, η, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
ys = xs−1 + s−1
s+2
(xs−1 − xs−2) + s+1
s+2
(zs−1 − xs−1).
(xs, zs) = One Stage AccSVRDA(ys, xs−1, η, β, m, b).
Output: xS.
25 / 39
26. One Stage AccSVRDA(x0, x, η, β, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
yk = xk−1 + k−1
k+1
(xk−1 − xk−2).
vk = 1
b
∑
i∈Ik
(∇fi(yk) − ∇fi(x)) + ∇F(x).
¯vk =
(
1 − 2
k+1
)
¯vk−1 + 2
k+1
vk
zk = proxηk(k+1)
4
(x0 − ηk(k+1)
4
¯vk)
xk =
(
1 − 2
k+1
)
xk−1 + 2
k+1
zk.
Output: (xm, zm).
Main Idea: Combining Inner Acceleration and Outer Acceleration
− For outer acceleration, adding new momentum s+1
s+2
(zs−1 − xs−1)
− AccSVRDA = AccSDA [Xiao (2009)] + Variance Reduction
Why SVRDA rather than SVRG?
− Only because lazy updates for AccSVRDA can be constructed.
26 / 39
28. Convergence Analysis (ℓ = 0)
Theorem (ℓ = 0)
Assume that F is (L, −µ)-smooth and fi is (L, 0)-smooth. If we
appropriately choose η = O
( 1
(1+n/b2)L
)
, S = O
(
1 + b
n
√
L
µ
+
√
L
nµ
)
and T = O(1), then DASVRDAsc
achieves an iteration complexity of
O
((
n
b
+
1
b
√
nL
µ
+
√
L
µ
)
log
(
1
ε
))
for E[P(ˇxT ) − P(x∗)] ≤ ε.
− In contrast, AccProxSVRG: O
((
n
b
+ L
bµ
+
√
L
µ
)
log
(1
ε
))
,
UC + SVRG: O
((
n
b
+
√
nL
bµ
)
log
(1
ε
))
.
28 / 39
29. Extension to ℓ ≥ 0
For generalizing our results to the case ℓ ≥ 0, we adopt UC +
AccProxSVRG approach.
− For theoretical guarantee, non-trivial modifications to the
algorithm of AccProxSVRG are needed.
UC + AccProxSVRG achieves
O
n
b
+
1
b
√
nL
µ
+
n
4
3
b
√
(Lℓ)
1
2
µ
log
(
1
ε
)
− In contrast, UC + SVRG only achieves
O
((
n
b
+
√
nL
bµ
+ n
4
3
b
√
(Lℓ)
1
2
µ
)
log
(1
ε
)
)
.
29 / 39
30. Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
30 / 39
31. Experimental Settings
Model: Elastic Net logistic regression
− Regularization parameters: (λ1, λ2) = (10−4, 10−6), (0, 10−6)
− µ = 10−6, ℓ = 0
Data sets and mini-batch sizes:
Data sets n d b
a9a 32, 561 123 180
rcv1 20, 242 47, 236 140
sido0 12, 678 4, 932 100
Implemented algorithms: SVRG, UC+SVRG, AccProxSVRG,
UC+AccProxSVRG, APCG (dual), Katyusha, DASVRDA and
DASVRDA with heuristic adaptive restart
31 / 39
35. Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
35 / 39
36. Summary
Conclusion:
New methods for solving convex composite optimization in
mini-batch settings
− Improvement of the mini-batch efficiency of previous methods
− Extention to sum-of-nonconvex objectives
− Numerical outperformance to the state-of-the-art methods
36 / 39
37. Reference I
Allen-Zhu, Z. (2017). Katyusha: The First Direct Acceleration of
Stochastic Gradient Methods. In 48th Annual ACM Symposium on
the Theory of Computing, pages 19–23.
Allen-Zhu, Z. and Li, Y. (2017). Neon2: Finding local minima via
first-order oracles. arXiv preprint arXiv:1711.06673.
Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. (2016).
Accelerated methods for non-convex optimization. arXiv preprint
arXiv:1611.00756.
Garber, D., Hazan, E., Jin, C., Sham, Musco, C., Netrapalli, P., and
Sidford, A. (2016). Faster eigenvector computation via
shift-and-invert preconditioning. In Proceedings of The 33rd
International Conference on Machine Learning, volume 48 of
Proceedings of Machine Learning Research, pages 2626–2634.
37 / 39
38. Reference II
Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient
descent using predictive variance reduction. In Advances in Neural
Information Processing Systems 26, pages 315–323.
Lin, H., Mairal, J., and Harchaoui, Z. (2015). A universal catalyst for
first-order optimization. In Advances in Neural Information
Processing Systems 28, pages 3384–3392.
Nitanda, A. (2014). Stochastic proximal gradient descent with
acceleration techniques. In Advances in Neural Information
Processing Systems 27, pages 1574–1582.
Xiao, L. (2009). Dual averaging method for regularized stochastic
learning and online optimization. In Advances in Neural
Information Processing Systems 22, pages 2116–2124.
38 / 39
39. Reference III
Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient
method with progressive variance reduction. SIAM Journal on
Optimization, 24(4), 2057–2075.
Yu, Y., Zou, D., and Gu, Q. (2017). Saving gradient and negative
curvature computations: Finding local minima more efficiently.
arXiv preprint arXiv:1712.03950.
39 / 39