SlideShare a Scribd company logo
1 of 82
Download to read offline
Optimization Algorithms in Machine Learning

                                    Stephen Wright

                              University of Wisconsin-Madison


                              NIPS Tutorial, 6 Dec 2010




Stephen Wright (UW-Madison)      Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   1 / 82
Optimization

     Optimization is going through a period of growth and revitalization,
     driven largely by new applications in many areas.
     Standard paradigms (LP, QP, NLP, MIP) are still important, along
     with general-purpose software, enabled by modeling languages that
     make the software easier to use.
     However, there is a growing emphasis on “picking and choosing”
     algorithmic elements to fit the characteristics of a given application
     — building up a suitable algorithm from a “toolkit” of components.
     It’s more important than ever to understand the fundamentals of
     algorithms as well as the demands of the application, so that good
     choices are made in matching algorithms to applications.

We present a selection of algorithmic fundamentals in this tutorial, with an
emphasis on those of current and potential interest in machine learning.

 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   2 / 82
Topics




  I. First-order Methods
 II. Stochastic and Incremental Gradient Methods
III. Shrinking/Thresholding for Regularized Formulations
IV. Optimal Manifold Identification and Higher-Order Methods.
 V. Decomposition and Coordinate Relaxation




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   3 / 82
I. First-Order Methods

min f (x), with smooth convex f . Usually assume
                                           2
                                µI             f (x)     LI for all x,

with 0 ≤ µ ≤ L. (L is thus a Lipschitz constant on the gradient                                  f .)
µ > 0 ⇒ strongly convex. Have
                                                           1
                   f (y ) − f (x) −        f (x)T (y − x) ≥ µ y − x                  2
                                                                                         .
                                                           2
(Mostly assume           ·     := ·    2 .)    Define conditioning κ := L/µ.
Sometimes discuss convex quadratic f :
                                1
                         f (x) = x T Ax, where µI                        A   LI .
                                2


 Stephen Wright (UW-Madison)          Optimization in Machine Learning       NIPS Tutorial, 6 Dec 2010   4 / 82
What’s the Setup?



Assume in this part of talk that we can evaluate f and f at each iterate
xi . But we are interested in extending to broader class of problems:
     nonsmooth f ;
     f not available;
     only an estimate of the gradient (or subgradient) is available;
     impose a constraint x ∈ Ω for some simple set Ω (e.g. ball, box,
     simplex);
     a nonsmooth regularization term may be added to the objective f .
Focus on algorithms that can be adapted to these circumstances.




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   5 / 82
Gradient

                   xk+1 = xk − αk f (xk ),              for some αk > 0.

Different ways to identify an appropriate αk .
  1   Hard: Interpolating scheme with safeguarding to identify an
      approximate minimizing αk .
  2   Easy: Backtracking. α, 1 α, 1 α, 1 α, ... until a sufficient decrease in
                          ¯ 2¯ 4¯ 8¯
      f is obtained.
  3   Trivial: Don’t test for function decrease. Use rules based on L and µ.
Traditional analysis for 1 and 2: Usually yields global convergence at
unspecified rate. The “greedy” strategy of getting good decrease from the
current search direction is appealing, and may lead to better practical
results.
Analysis for 3: Focuses on convergence rate, and leads to accelerated
multistep methods.
 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   6 / 82
Constant (Short) Steplength
By elementary use of Taylor’s theorem, obtain
                                          αk
             f (xk+1 ) ≤ f (xk ) − αk 1 −    L                            f (xk ) 2 .
                                                                                  2
                                          2
For αk ≡ 1/L, have
                                         1
                                              f (xk ) 2 .
                               f (xk+1 ) ≤ f (xk ) −  2
                                        2L
It follows by elementary arguments (see e.g. Nesterov 2004) that
                                                             2L x0 − x    2
                               f (xk+1 ) − f (x ∗ ) ≤                         .
                                                                k +1
The classic 1/k convergence rate!
By assuming µ > 0, can set αk ≡ 2/(µ + L) and get a linear (geometric)
rate: Much better than sublinear, in the long run
                                      2k                                            2k
                          L−µ                                             2
     xk − x ∗    2
                     ≤                     x0 − x ∗      2
                                                             =     1−                     x0 − x ∗ 2 .
                          L+µ                                            κ+1
 Stephen Wright (UW-Madison)          Optimization in Machine Learning        NIPS Tutorial, 6 Dec 2010   7 / 82
The 1/k 2 Speed Limit
Nesterov (2004) gives a simple example of a smooth function for which no
method that generates iterates of the form xk+1 = xk − αk f (xk ) can
converge at a rate faster than 1/k 2 , at least for its first n/2 iterations.
Note that xk+1 ∈ x0 + span( f (x0 ), f (x1 ), . . . , f (xk )).
                                                                
               2 −1 0          0 ... ... 0                        1
             −1 2 −1 0 . . . . . . 0                           0
                                                                
        A =  0 −1 2 −1 0 . . . 0 ,                        e1 = 0
                                                                
                         ..   .. ..                              .
                                                                 .
                                                  
                           .   .    .                           .
               0 ...                0 −1 2                        0
and set f (x) = (1/2)x T Ax − e1 x. The solution has x ∗ (i) = 1 − i/(n + 1).
                               T

If we start at x0 = 0, each f (xk ) has nonzeros only in its first k + 2
entries. Hence, xk+1 (i) = 0 for i = k + 3, k + 4, . . . , n. Can show
                                                  3L x0 − x ∗ 2
                               f (xk ) − f ∗ ≥                  .
                                                   32(k + 1)2
 Stephen Wright (UW-Madison)       Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   8 / 82
Exact minimizing αk : Faster rate?

Take αk to be the exact minimizer of f along − f (xk ). Does this yield a
better rate of linear convergence?
Consider the convex quadratic f (x) = (1/2)x T Ax. (Thus x ∗ = 0 and
f (x ∗ ) = 0.) Here κ is the condition number of A.
We have f (xk ) = Axk . Exact minimizing αk :

                     xk A2 xk
                      T
                                       1
            αk =      T A3 x
                              = arg min (xk − αAxk )T A(xk − αAxk ),
                     xk     k        α 2


                               1 1
which is in the interval       L, µ    . Can show that

                                                             2k
                               ∗             2
                 f (xk ) − f (x ) ≤      1−                       [f (x0 ) − f (x ∗ )].
                                            κ+1

No improvement in the linear rate over constant steplength.

 Stephen Wright (UW-Madison)       Optimization in Machine Learning        NIPS Tutorial, 6 Dec 2010   9 / 82
Multistep Methods: Heavy-Ball

Enhance the search direction by including a contribution from the previous
step.
Consider first constant step lengths:

                         xk+1 = xk − α f (xk ) + β(xk − xk−1 )

Analyze by defining a composite iterate vector:

                                              xk − x ∗
                                 wk :=                  .
                                             xk−1 − x ∗

Thus
                                                    −α       2 f (x ∗ )   + (1 + β)I         −βI
   wk+1 = Bwk + o( wk ),                 B :=                                                    .
                                                                     I                        0


 Stephen Wright (UW-Madison)     Optimization in Machine Learning         NIPS Tutorial, 6 Dec 2010   10 / 82
B has same eigenvalues as

             −αΛ + (1 + β)I        −βI
                                       ,              Λ = diag(λ1 , λ2 , . . . , λn ),
                   I                0

where λi are the eigenvalues of 2 f (x ∗ ). Choose α, β to explicitly
minimize the max eigenvalue of B, obtain
                                                                             2
                          4       1                                  2
                   α=               √   ,        β=         1− √                 .
                          L (1 + 1/ κ)2                             κ+1

Leads to linear convergence for xk − x ∗ with rate approximately

                                                 2
                                      1− √                   .
                                                κ+1




 Stephen Wright (UW-Madison)     Optimization in Machine Learning    NIPS Tutorial, 6 Dec 2010   11 / 82
Summary: Linear Convergence, Strictly Convex f


Best steepest descent: Linear rate approx (1 − 2/κ);
                                      √
Heavy-ball: linear rate approx (1 − 2/ κ).
Big difference! To reduce xk − x ∗ by a factor , need k large enough that
                               k
                       2                    κ
                 1−                ≤   ⇐ k≥   | log | (steepest descent)
                       κ                    2
                                            √
               2 k                             κ
           1− √     ≤                  ⇐ k≥      | log | (heavy-ball)
                κ                            2
           √
A factor of κ difference. e.g. if κ = 100, need 10 times fewer steps.




 Stephen Wright (UW-Madison)           Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   12 / 82
Conjugate Gradient
Basic step is

                 xk+1 = xk + αk pk ,           pk = − f (xk ) + γk pk−1 .

We can identify it with heavy-ball by setting βk = αk γk /αk−1 . However,
CG can be implemented in a way that doesn’t require knowledge (or
estimation) of L and µ.
     Choose αk to (approximately) miminize f along pk ;
     Choose γk by a variety of formulae (Fletcher-Reeves, Polak-Ribiere,
     etc), all of which are equivalent if f is convex quadratic. e.g.

                               γk =      f (xk ) 2 /         f (xk−1 ) 2 .

There is a rich convergence theory for f quadratic, including asymptotic
                                           √
linear convergence with rate approx 1 − 2/ κ. (Like heavy-ball.)
See e.g. Chap. 5 of Nocedal & Wright (2006) and refs therein.
 Stephen Wright (UW-Madison)     Optimization in Machine Learning     NIPS Tutorial, 6 Dec 2010   13 / 82
Accelerated First-Order Methods
Accelerate the rate to 1/k 2 for weakly convex, while retaining the linear
                √
rate (related to κ) for strongly convex case.
Nesterov (1983, 2004) describes a method that requires κ.
 0: Choose x0 , α0 ∈ (0, 1); set y0 ← x0 ./
                  1
 k: xk+1 ← yk − L f (yk ); (*short-step gradient*)
                                2               2
    solve for αk+1 ∈ (0, 1): αk+1 = (1 − αk+1 )αk + αk+1 /κ;
    set βk = αk (1 − αk )/(αk 2 +α
                                    k+1 );
    set yk+1 ← xk+1 + βk (xk+1 − xk ).
Still works for weakly convex (κ = ∞).
FISTA (Beck & Teboulle 2007):
 0: Choose x0 ; set y1 = x0 , t1 = 1;
                1
 k: xk ← yk − L f (yk );
                 1                   2
     tk+1 ←      2    1+       1 + 4tk ;
                         tk −1
     yk+1 ← xk +         tk+1 (xk   − xk−1 ).
 Stephen Wright (UW-Madison)        Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   14 / 82
yk+2

                                                          xk+2


                                                           yk+1
                                 xk+1
          xk


           yk


Stephen Wright (UW-Madison)   Optimization in Machine Learning    NIPS Tutorial, 6 Dec 2010   15 / 82
Convergence Results: Nesterov


          √
If α0 ≥ 1/ κ, have
                                                              k
                                                  1                     4L
            f (xk ) − f (x ∗ ) ≤ c1 min       1− √                 , √                   ,
                                                   κ                ( L + c2 k)2

where constants c1 and c2 depend on x0 , α0 , L.
Linear convergence at “heavy-ball” rate in strongly convex case, otherwise
1/k 2 . FISTA also achieves 1/k 2 rate.
Analysis: Not intuitive. Based on bounding the difference between f and a
quadratic approximation to it, at x ∗ . FISTA analysis is 2-3 pages.




 Stephen Wright (UW-Madison)    Optimization in Machine Learning       NIPS Tutorial, 6 Dec 2010   16 / 82
A Non-Monotone Gradient Method: Barzilai-Borwein
(Barzilai & Borwein 1988) BB is a gradient method, but with an unusual
choice of αk . Allows f to increase (sometimes dramatically) on some steps.
                                                                                         2
            xk+1 = xk − αk f (xk ),               αk := arg min sk − αzk                     ,
                                                                    α

where
                 sk := xk − xk−1 ,          zk :=        f (xk ) −      f (xk−1 ).
Explicitly, we have
                                                  T
                                                 sk zk
                                       αk =       T
                                                       .
                                                 zk zk
Note that for convex quadratic f = (1/2)x T Ax, we have
                                        T
                                       sk Ask
                               αk =     T
                                               ∈ [L−1 , µ−1 ].
                                      sk A2 sk
Hence, can view BB as a kind of quasi-Newton method, with the Hessian
                 −1
approximated by αk I .
 Stephen Wright (UW-Madison)     Optimization in Machine Learning       NIPS Tutorial, 6 Dec 2010   17 / 82
Comparison: BB vs Greedy Steepest Descent




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   18 / 82
Many BB Variants
                    T     T                      T      T
     can use αk = sk sk /sk zk in place of αk = sk zk /zk zk ;
     alternate between these two formulae;
     calculate αk as above and hold it constant for 2, 3, or 5 successive
     steps;
     take αk to be the exact steepest descent step from the previous
     iteration.

Nonmonotonicity appears essential to performance. Some variants get
global convergence by requiring a sufficient decrease in f over the worst of
the last 10 iterates.
The original 1988 analysis in BB’s paper is nonstandard and illuminating
(just for a 2-variable quadratic).
In fact, most analyses of BB and related methods are nonstandard, and
consider only special cases. The precursor of such analyses is Akaike
(1959). More recently, see Ascher, Dai, Fletcher, Hager and others.
 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   19 / 82
Primal-Dual Averaging

(see Nesterov 2009) Basic step:
                                     k
                            1                                                 γ
   xk+1 = arg min                         [f (xi ) +    f (xi )T (x − xi )] + √ x − x0               2
                     x    k +1                                                 k
                                    i=0
                                    γ
           = arg min      ¯T
                          gk x    + √ x − x0 2 ,
                     x               k
                    k
where gk :=
      ¯             i=0        f (xi )/(k + 1) — the averaged gradient.

     The last term is always centered at the first iterate x0 .
     Gradient information is averaged over all steps, with equal weights.
     γ is constant - results can be sensitive to this value.
     The approach still works for convex nondifferentiable f , where                                 f (xi )
     is replaced by a vector from the subgradient ∂f (xi ).

 Stephen Wright (UW-Madison)         Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010        20 / 82
Convergence Properties
Nesterov proves convergence for averaged iterates:
                                                       k
                                             1
                               xk+1 =
                               ¯                           xi .
                                           k +1
                                                     i=0

Provided the iterates and the solution x ∗ lie within some ball of radius D
around x0 , we have
                                                C
                         f (¯k+1 ) − f (x ∗ ) ≤ √ ,
                            x
                                                 k
where C depends on D, a uniform bound on                          f (x) , and γ (coefficient
of stabilizing term).
Note: There’s averaging in both primal (xi ) and dual ( f (xi )) spaces.
Generalizes easily and robustly to the case in which only estimated
gradients or subgradients are available.
(Averaging smooths the errors in the individual gradient estimates.)
 Stephen Wright (UW-Madison)   Optimization in Machine Learning       NIPS Tutorial, 6 Dec 2010   21 / 82
Extending to the Constrained Case: x ∈ Ω


How do these methods change when we require x ∈ Ω, with Ω closed and
convex?
Some algorithms and theory stay much the same, provided we can involve
Ω explicity in the subproblems.
Example: Primal-Dual Averaging for minx∈Ω f (x).
                                                γ
                          xk+1 = arg min gk x + √ x − x0 2 ,
                                         ¯T
                                     x∈Ω         k
                    k
where gk :=
       ¯            i=0        f (xi )/(k + 1). When Ω is a box, this subproblem is
easy to solve.




 Stephen Wright (UW-Madison)         Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   22 / 82
Example: Nesterov’s Constant Step Scheme for minx∈Ω f (x). Requires
just only calculation to be changed from the unconstrained version.
 0: Choose x0 , α0 ∈ (0, 1); set y0 ← x0 , q ← 1/κ = µ/L.
                         1           1
 k: xk+1 ← arg miny ∈Ω 2 y − [yk − L f (yk )] 2 ;
                                               2
                               2                 2
    solve for αk+1 ∈ (0, 1): αk+1 = (1 − αk+1 )αk + qαk+1 ;
                             2
    set βk = αk (1 − αk )/(αk + αk+1 );
    set yk+1 ← xk+1 + βk (xk+1 − xk ).

Convergence theory is unchanged.




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   23 / 82
Regularized Optimization (More Later)

FISTA can be applied with minimal changes to the regularized problem

                                    min f (x) + τ ψ(x),
                                       x

where f is convex and smooth, ψ convex and “simple” but usually
nonsmooth, and τ is a positive parameter.
Simply replace the gradient step by
                                                                       2
                                   L          1
                xk = arg min         x − yk −   f (yk )                    + τ ψ(x).
                               x   2          L

(This is the “shrinkage” step; when ψ ≡ 0 or ψ = ·                            1,   can be solved
cheaply.)
More on this later.

 Stephen Wright (UW-Madison)        Optimization in Machine Learning       NIPS Tutorial, 6 Dec 2010   24 / 82
Further Reading



  1   Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course,
      Kluwer Academic Publishers, 2004.
  2   A. Beck and M. Teboulle, “Gradient-based methods with application to signal
      recovery problems,” in press, 2010. (See Teboulle’s web site).
  3   B. T. Polyak, Introduction to Optimization, Optimization Software Inc, 1987.
  4   J. Barzilai and J. M. Borwein, “Two-point step size gradient methods,” IMA
      Journal of Numerical Analysis, 8, pp. 141-148, 1988.
  5   Y. Nesterov, “Primal-dual subgradient methods for convex programs,”
      Mathematical Programming, Series B, 120, pp. 221-259, 2009.




 Stephen Wright (UW-Madison)    Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   25 / 82
II. Stochastic and Incremental Gradient Methods

Still deal with (weakly or strongly) convex f . But change the rules:
     Allow f nonsmooth.
     Can’t get function values f (x).
     At any feasible x, have access only to an unbiased estimate of an
     element of the subgradient ∂f .
Common settings are:
                                f (x) = Eξ F (x, ξ),
where ξ is a random vector with distribution P over a set Ξ. Also the
special case:
                                               m
                                 f (x) =            fi (x),
                                              i=1

where each fi is convex and nonsmooth.


 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   26 / 82
Applications


This setting is useful for machine learning formulations. Given data
xi ∈ Rn and labels yi = ±1, i = 1, 2, . . . , m, find w that minimizes
                                              m
                               τ ψ(w ) +            (w ; xi , yi ),
                                             i=1

where ψ is a regularizer, τ > 0 is a parameter, and is a loss. For linear
classifiers/regressors, have the specific form (w T xi , yi ).
Example: SVM with hinge loss (w T xi , yi ) = max(1 − yi (w T xi ), 0) and
ψ = · 1 or ψ = · 2 .
                  2

Example: Logistic regression: (w T xi , yi ) = log(1 + exp(yi w T xi )). In
regularized version may have ψ(w ) = w 1 .



 Stephen Wright (UW-Madison)     Optimization in Machine Learning     NIPS Tutorial, 6 Dec 2010   27 / 82
Subgradients


For each x in domain of f , g is a subgradient of f at x if

                  f (z) ≥ f (x) + g T (z − x),              for all z ∈ domf .

Right-hand side is a supporting hyperplane.
The set of subgradients is called the subdifferential, denoted by ∂f (x).
When f is differentiable at x, have ∂f (x) = { f (x)}.
We have strong convexity with modulus µ > 0 if
                        1
f (z) ≥ f (x)+g T (z−x)+ µ z−x               2
                                                 ,   for all x, z ∈ domf with g ∈ ∂f (x).
                        2
Generalizes the assumption          2 f (x)          µI made earlier for smooth
functions.


 Stephen Wright (UW-Madison)     Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   28 / 82
f




                                                         supporting hyperplanes

                                                        x




Stephen Wright (UW-Madison)   Optimization in Machine Learning     NIPS Tutorial, 6 Dec 2010   29 / 82
“Classical” Stochastic Approximation


Denote by G (x, ξ) ths subgradient estimate generated at x. For
unbiasedness need Eξ G (x, ξ) ∈ ∂f (x).
Basic SA Scheme: At iteration k, choose ξk i.i.d. according to distribution
P, choose some αk > 0, and set

                               xk+1 = xk − αk G (xk , ξk ).

Note that xk+1 depends on all random variables up to iteration k, i.e.
ξ[k] := {ξ1 , ξ2 , . . . , ξk }.
When f is strongly convex, the analysis of convergence of E ( xk − x ∗ 2 ) is
fairly elementary - see Nemirovski et al (2009).




 Stephen Wright (UW-Madison)      Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   30 / 82
Rate: 1/k
Define ak = 1 E ( xk − x ∗ 2 ). Assume there is M > 0 such that
              2
E ( G (x, ξ) 2 ) ≤ M 2 for all x of interest. Thus

  1
     xk+1 − x ∗ 2
                2
  2
             1
          = xk − αk G (xk , ξk ) − x ∗ 2
             2
             1                                             1 2
          = xk − x ∗ 2 − αk (xk − x ∗ )T G (xk , ξk ) + αk G (xk , ξk ) 2 .
                         2
             2                                             2
Taking expectations, get
                                                          1 2
            ak+1 ≤ ak − αk E [(xk − x ∗ )T G (xk , ξk )] + αk M 2 .
                                                          2
For middle term, have
      E [(xk − x ∗ )T G (xk , ξk )] = Eξ[k−1] Eξk [(xk − x ∗ )T G (xk , ξk )|ξ[k−1] ]
                                    = Eξ[k−1] (xk − x ∗ )T gk ,
 Stephen Wright (UW-Madison)     Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   31 / 82
... where
                          gk := Eξk [G (xk , ξk )|ξ[k−1] ] ∈ ∂f (xk ).
By strong convexity, have
                                            1
      (xk − x ∗ )T gk ≥ f (xk ) − f (x ∗ ) + µ xk − x ∗                2
                                                                           ≥ µ xk − x ∗ 2 .
                                            2
Hence by taking expectations, we get E [(xk − x ∗ )T gk ] ≥ 2µak . Then,
substituting above, we obtain
                                                     1 2
                               ak+1 ≤ (1 − 2µαk )ak + αk M 2
                                                     2

When
                                                      1
                                            αk ≡        ,
                                                     kµ
a neat inductive argument (exercise!) reveals the 1/k rate:

                          Q                                                   M2
                 ak ≤        ,      for Q := max               x1 − x ∗ 2 ,          .
                          2k                                                  µ2

 Stephen Wright (UW-Madison)        Optimization in Machine Learning    NIPS Tutorial, 6 Dec 2010   32 / 82
But... What if we don’t know µ? Or if µ = 0?



The choice αk = 1/(kµ) requires strong convexity, with knowledge of the
modulus µ. An underestimate of µ can greatly degrade the performance of
the method (see example in Nemirovski et al. 2009).
Now describe a Robust Stochastic Approximation approach, which has a
       √
rate 1/ k (in function value convergence), and works for weakly convex
nonsmooth functions and is not sensitive to choice of parameters in the
step length.
This is the approach that generalizes to mirror descent.




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   33 / 82
Robust SA


At iteration k:
     set xk+1 = xk − αk G (xk , ξk ) as before;
     set
                                                  k
                                                  i=1 αi xi
                                     xk =
                                     ¯              k
                                                                  .
                                                    i=1 αi

For any θ > 0 (not critical), choose step lengths to be

                                               θ
                                    αk =       √ .
                                              M k
Then f (¯k ) converges to f (x ∗ ) in expectation with rate approximately
        x
(log k)/k 1/2 . The choice of θ is not critical.



 Stephen Wright (UW-Madison)   Optimization in Machine Learning       NIPS Tutorial, 6 Dec 2010   34 / 82
Analysis of Robust SA
The analysis is again elementary. As above (using i instead of k), have:

                                                           1
                      αi E [(xi − x ∗ )T gi ] ≤ ai − ai+1 + αi2 M 2 .
                                                           2
By convexity of f , and gi ∈ ∂f (xi ):

                               f (x ∗ ) ≥ f (xi ) + giT (x ∗ − xi ),

thus
                                                       1
               αi E [f (xi ) − f (x ∗ )] ≤ ai − ai+1 + αi2 M 2 ,
                                                       2
so by summing iterates i = 1, 2, . . . , k, telescoping, and using ak+1 > 0:
                       k                                               k
                                                           1
                           αi E [f (xi ) − f (x ∗ )] ≤ a1 + M 2              αi2 .
                                                           2
                     i=1                                               i=1


 Stephen Wright (UW-Madison)        Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   35 / 82
Thus dividing by           i=1 αi :

                         k                                                      k
                         i=1 αi f (xi )                       a1 + 1 M 2             2
                                                                                i=1 αi
                E          k
                                          − f (x ∗ ) ≤             2
                                                                         k
                                                                                          .
                           i=1 αi                                        i=1 αi

By convexity, we have
                                                     k
                                                     i=1 αi f (xi )
                                  f (¯k ) ≤
                                     x                 k
                                                                    ,
                                                       i=1 αi

so obtain the fundamental bound:
                                                           1             k
                                                      a1 + 2 M 2              2
                                                                         i=1 αi
                       E [f (¯k ) − f (x ∗ )] ≤
                             x                                     k
                                                                                  .
                                                                   i=1 αi




 Stephen Wright (UW-Madison)          Optimization in Machine Learning      NIPS Tutorial, 6 Dec 2010   36 / 82
θ
By substituting αi =            √ ,
                               M i
                                      we obtain

                                                   1          k   1
                                              a1 + 2 θ2
                 E [f (¯k ) − f (x ∗ )] ≤
                       x                           θ       k
                                                              i=1 i
                                                               1
                                                              √
                                                   M       i=1 i

                                           a1 + θ2 log(k + 1)
                                          ≤       θ
                                                    √
                                                  M k
                                               a1
                                          =M      + θ log(k + 1) k −1/2 .
                                               θ
That’s it!
Other variants: constant stepsizes αk for a fixed “budget” of iterations;
periodic restarting; averaging just over the recent iterates. All can be
analyzed with the basic bound above.



 Stephen Wright (UW-Madison)          Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   37 / 82
Mirror Descent

The step from xk to xk+1 can be viewed as the solution of a subproblem:
                                                                       1
              xk+1 = arg min G (xk , ξk )T (z − xk ) +                    z − xk        2
                                                                                        2,
                               z                                      2αk
a linear estimate of f plus a prox-term. This provides a route to handling
constrained problems, regularized problems, alternative prox-functions.
For the constrained problem minx∈Ω f (x), simply add the restriction z ∈ Ω
to the subproblem above. In some cases (e.g. when Ω is a box), the
subproblem is still easy to solve.
We may use other prox-functions in place of (1/2) z − x 2 above. Such
                                                             2
alternatives may be particularly well suited to particular constraint sets Ω.
Mirror Descent is the term used for such generalizations of the SA
approaches above.

 Stephen Wright (UW-Madison)       Optimization in Machine Learning      NIPS Tutorial, 6 Dec 2010   38 / 82
Mirror Descent cont’d
Given constraint set Ω, choose a norm · (not necessarily Euclidean).
Define the distance-generating function ω to be a strongly convex function
on Ω with modulus 1 with respect to · , that is,

            (ω (x) − ω (z))T (x − z) ≥ x − z                     2
                                                                     ,   for all x, z ∈ Ω,

where ω (·) denotes an element of the subdifferential.
Now define the prox-function V (x, z) as follows:

                       V (x, z) = ω(z) − ω(x) − ω (x)T (z − x).

This is also known as the Bregman distance. We can use it in the
subproblem in place of 1 · 2 :
                       2

                                                                         1
                xk+1 = arg min G (xk , ξk )T (z − xk ) +                    V (z, xk ).
                               z∈Ω                                       αk

 Stephen Wright (UW-Madison)         Optimization in Machine Learning       NIPS Tutorial, 6 Dec 2010   39 / 82
Bregman distance is the deviation from linearity:


                                                                  ω



                                                                      V(x,z)




                                           x                      z


 Stephen Wright (UW-Madison)   Optimization in Machine Learning         NIPS Tutorial, 6 Dec 2010   40 / 82
Bregman Distances: Examples
For any Ω, we can use ω(x) := (1/2) x − x 2 , leading to prox-function
                                        ¯ 2
                       2.
V (x, z) = (1/2) x − z 2
For the simplex Ω = {x ∈ Rn : x ≥ 0, n xi = 1}, we can use instead
                                        i=1
the 1-norm · 1 , choose ω to be the entropy function
                                                n
                                  ω(x) =             xi log xi ,
                                               i=1

leading to Bregman distance
                                               n
                               V (x, z) =           zi log(zi /xi ).
                                             i=1


These are the two most useful cases.
Convergence results for SA can be generalized to mirror descent.
 Stephen Wright (UW-Madison)      Optimization in Machine Learning     NIPS Tutorial, 6 Dec 2010   41 / 82
Incremental Gradient
(See e.g. Bertsekas (2011) and references therein.) Finite sums:
                                                    m
                                      f (x) =             fi (x).
                                                    i=1

Step k typically requires choice of one index ik ∈ {1, 2, . . . , m} and
evaluation of fik (xk ). Components ik are selected sometimes randomly or
cyclically. (Latter option does not exist in the setting f (x) := Eξ F (x; ξ).)

     There are incremental versions of the heavy-ball method:
                          xk+1 = xk − αk fik (xk ) + β(xk − xk−1 ).
     Approach like dual averaging: assume a cyclic choice of ik , and
     approximate f (xk ) by the average of fi (x) over the last m iterates:
                                                      m
                                               αk
                               xk+1 = xk −                    fik−l+1 (xk−l+1 ).
                                               m
                                                     l=1

 Stephen Wright (UW-Madison)        Optimization in Machine Learning     NIPS Tutorial, 6 Dec 2010   42 / 82
Achievable Accuracy

Consider the basic incremental method:

                               xk+1 = xk − αk fik (xk ).

How close can f (xk ) come to f (x ∗ ) — deterministically (not just in
expectation).
Bertsekas (2011) obtains results for constant steps αk ≡ α.

            cyclic choice of ik :    lim inf f (xk ) ≤ f (x ∗ ) + αβm2 c 2 .
                                           k→∞

           random choice of ik :        lim inf f (xk ) ≤ f (x ∗ ) + αβmc 2 .
                                             k→∞

where β is close to 1 and c is a bound on the Lipschitz constants for                           fi .
(Bertsekas actually proves these results in the more general context of
regularized optimization - see below.)

 Stephen Wright (UW-Madison)     Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   43 / 82
Applications to SVM

SA techniques have an obvious application to linear SVM classification. In
fact, they were proposed in this context and analyzed independently by
researchers in the ML community for some years.
Codes: SGD (Bottou), PEGASOS (Shalev-Schwartz et al, 2007).
Tutorial: Stochastic Optimization for Machine Learning, Tutorial by N.
Srebro and A. Tewari, ICML 2010 for many more details on the
connections between stochastic optimization and machine learning.
Related Work: Zinkevich (ICML, 2003) on online convex programming.
Aiming to approximate the minimize the average of a sequence of convex
functions, presented sequentially. No i.i.d. assumption, regret-based
analysis. Take steplengths of size O(k −1/2 ) in gradient fk (xk ) of latest
convex function. Average regret is O(k −1/2 ).


 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   44 / 82
Further Reading
  1   A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic
      approximation approach to stochastic programming,” SIAM Journal on
      Optimization, 19, pp. 1574-1609, 2009.
  2   D. P. Bertsekas, “Incremental gradient, subgradient, and proximal methods for
      convex optimization: A Survey,” Chapter 4 in Optimization and Machine Learning,
      upcoming volume edited by S. Nowozin, S. Sra, and S. J. Wright (2011).
  3   A. Juditsky and A. Nemirovski, “ First-order methods for nonsmooth convex
      large-scale optimization. I: General-purpose methods,” Chapter 5 in Optimization
      and Machine Learning (2011).
  4   A. Juditsky and A. Nemirovski, “ First-order methods for nonsmooth convex
      large-scale optimization. I: Utilizing problem structure,” Chapter 6 in Optimization
      and Machine Learning (2011).
  5   O. L. Mangasarian and M. Solodov, “Serial and parallel backpropagation
      convergencevia nonmonotone perturbed minimization,” Optimization Methods and
      Software 4 (1994), pp. 103–116.
  6   D. Blatt, A. O. Hero, and H. Gauchman, “A convergent incremental gradient
      method with constant step size,” SIAM Journal on Optimization 18 (2008), pp.
      29–51.
 Stephen Wright (UW-Madison)     Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   45 / 82
III. Shrinking/Thresholding for Regularized Optimization
In many applications, we seek not an exact minimizer of the underlying
objective, but rather an approximate minimizer that satisfies certain
desirable properties:
     sparsity (few nonzeros);
     low-rank (if a matrix);
     low “total-variation”;
     generalizability. (Vapnik: “...tradeoff between the quality of the
     approximation of the given data and the complexity of the
     approximating function.”)
“Desirable” properties depend on context and application .
A common way to obtain structured solutions is to modify the objective f
by adding a regularizer τ ψ(x), for some parameter τ > 0.

                               min f (x) + τ ψ(x).

Often want to solve for a range of τ values, not just one value in isolation.
 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   46 / 82
Basics of Shrinking
Regularizer ψ is often nonsmooth but “simple.” Shrinking / thresholding
approach (a.k.a. forward-backward splitting) is useful if the problem is
easy to solve when f is replaced by a quadratic with diagonal Hessian:
                                                      1
                        min g T (z − x) +               z −x          2
                                                                      2    + τ ψ(z).
                          z                          2α
Equivalently,
                         1
                            z − (x − αg ) 2 + τ ψ(z).
                               min        2
                        2α      z

Define the shrinking operator as the arg min:
                                                        1                  2
                      Sτ (y , α) := arg min               z −y             2   + τ ψ(z).
                                                 z     2α
Typical algorithm:
                                    xk+1 = Sτ (xk − αk gk , αk ),
with for example gk =                f (xk ).
 Stephen Wright (UW-Madison)            Optimization in Machine Learning          NIPS Tutorial, 6 Dec 2010   47 / 82
“Practical” Instances of ψ


Cases for which the subproblem is simple:
     ψ(z) = z 1 . Thus Sτ (y , α) = sign(y ) max(|y | − ατ, 0). When y
     complex, have

                                                 max(|y | − τ α, 0)
                               Sτ (y , α) =                             y.
                                               max(|y | − τ α, 0) + τ α

     ψ(z) = g ∈G z[g ] 2 or ψ(z) = g ∈G z[g ]                           ∞,   where z[g ] , g ∈ G
     are non-overlapping subvectors of z. Here

                                                max(|y[g ] | − τ α, 0)
                         Sτ (y , α)[g ] =                                 y .
                                              max(|y[g ] | − τ α, 0) + τ α [g ]




 Stephen Wright (UW-Madison)         Optimization in Machine Learning    NIPS Tutorial, 6 Dec 2010   48 / 82
Z is a matrix and ψ(Z ) = Z ∗ is the nuclear norm of Z : the sum of
    singular values. Threshold operator is
                                                    1               2
                       Sτ (Y , α) := arg min          Z −Y          F   +τ Z        ∗
                                             Z     2α

    with solution obtained from the SVD Y = UΣV T with U, V
    orthonormal and Σ = diag(σi )i=1,2,...,m . Setting
    ˜
    Σ = diag(max(σi − τ α, 0)i=1,2,...,m ), the solution is
                                                   ˜
                                    Sτ (Y , α) = U ΣV T .

    (Actually not cheap to compute, but in some cases (e.g. σi − τ α < 0
    for most i) approximate solutions can be found in reasoable time.)




Stephen Wright (UW-Madison)      Optimization in Machine Learning       NIPS Tutorial, 6 Dec 2010   49 / 82
Connections



The thresholding operator generalizes:
     Gradient methods for unconstrained minimization. Here ψ ≡ 0 and
     Sτ (y , α) = y .
     Projected gradient for minx∈Ω f (x) with Ω closed and convex. Here ψ
     is the indicator function for Ω (zero on Ω, ∞ elsewhere), and

                                    Sτ (y , α) = PΩ (y ),

     where PΩ is projection onto Ω.




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   50 / 82
Applications
LASSO for variable selection. Originally stated as
                     1
                       Ax − b 2 such that x 1 ≤ T ,
                        min     2
                     2    x
for parameter T > 0. Equivalent to an “ 2 - 1 ” formulation:
                         1              2
                  min      Ax − b       2   +τ x         1,          for some τ > 0.
                    x    2
Group LASSO for selection of variable “groups.”
                                     1               2
                               min     Ax − b        2   +            x[g ] 2 ,
                                x    2
                                                              g ∈G

with each [g ] a subset of indices {1, 2, . . . , n}.
     When groups [g ] are disjoint, easy to solve the subproblem.
     Still true if · 2 is replaced by · ∞ .
     When groups overlap, can replicate variables, to have one copy of
     each variable in each group — thus reformulate as non-overlapping.
 Stephen Wright (UW-Madison)         Optimization in Machine Learning         NIPS Tutorial, 6 Dec 2010   51 / 82
Compressed Sensing. Sparse signal recovery from noisy measurements.
Given matrix A (with more columns than rows) and observation vector y ,
seek a sparse x (i.e. few nonzeros) such that Ax ≈ y . Solve

                                      1                 2
                               min      Ax − b          2   +τ x       1.
                                x     2


     Under “restricted isometry” properties on A (“tall, thin” column
     submatrices are nearly orthonormal), x 1 is a good surrogate for
     card(x).
     Assume that A is not stored explicitly, but matrix-vector
     multiplications are available. Hence can compute f and f .
     Often need solution for a range of τ values.




 Stephen Wright (UW-Madison)        Optimization in Machine Learning        NIPS Tutorial, 6 Dec 2010   52 / 82
1 -Regularized Logistic Regression. Feature vectors xi , i = 1, 2, . . . , m
with labels ±1. Seek odds function parametrized by w ∈ Rn :
                                       Tx
            p+ (x; w ) := (1 + e w          )−1 ,   p− (x; w ) := 1 − p+ (x; w ).

Scaled, negative log likelihood function L(w ) is
                                                                         
                      1
          L(w ) = −             log p− (xi ; w ) +       log p+ (xi ; w )
                     m
                          yi =−1                    yi =1
                                                                
                                            m
                      1                                      T
                =−              w T xi −       log(1 + e w xi ) .
                     m
                               yi =−1               i=1


To get a sparse w (i.e. classify on the basis of a few features) solve:

                                 min L(w ) + λ w               1.
                                   w


 Stephen Wright (UW-Madison)      Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   53 / 82
Matrix Completion. Seek a matrix X ∈ Rm×n with low rank that
matches certain observations, possibly noisy.
                                     1                    2
                               min     A(X ) − b          2   + τ ψ(X ),
                               X     2
where A(X ) is a linear mapping of the components of X (e.g.
element-wise observations).
Can have ψ as the nuclear norm — see discussion above for solution of
subproblems via SVD.
At NIPS 2010: “Practical Large-Scale Optimization for Max-Norm
Regularization” by Lee et al. discuss ψ(X ) = X max :

                     X max := inf{ U            2,∞ ,    V    2,∞ | X   = UV T },

where U 2,∞ is the maximum 2 norm of a row of U. The shrinking
operation can be solved efficiently using the “squash” operator.

 Stephen Wright (UW-Madison)         Optimization in Machine Learning      NIPS Tutorial, 6 Dec 2010   54 / 82
Basic Algorithm


(Fukushima and Mine, 1981) for solving minx f (x) + τ ψ(x).
 0: Choose x0
 k: Choose αk > 0 and set

            xk+1 = Sτ (xk − αk f (xk ); αk )
                                                                  1
                    = arg min      f (xk )T (z − xk ) +              z − xk     2
                                                                                2   + τ ψ(z).
                               z                                 2αk

Straightforward, but can be fast when the regularization is strong (i.e.
solution is “highly constrained”).
Can show convergence for steps αk ∈ (0, 2/L), where L is the bound on
 2 f . (Like a short-step gradient method.)




 Stephen Wright (UW-Madison)       Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   55 / 82
Enhancements

Alternatively, since αk plays the role of a steplength, can adjust it to get
better performance and guaranteed convergence.

     “Backtracking:” decrease αk until sufficient decrease condition holds.
     Use Barzilai-Borwein strategies to get nonmonotonic methods. By
     enforcing sufficient decrease every 10 iterations (say), still get global
     convergence.

The approach can be accelerated using optimal gradient techniques. See
earlier discussion of FISTA, where we solve the shrinking problem with
αk = 1/L in place of a step along − f with this steplength.
Note that these methods reduce ultimately to gradient methods on a
reduced space: the optimal manifold defined by the regularizer ψ.
Acceleration or higher-order information can help improve performance.

 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   56 / 82
Continuation in τ


Performance of basic shrinking methods is quite sensitive to τ .
Typically higher τ ⇒ stronger regularization ⇒ optimal manifold has lower
dimension. Hence, it’s easier to identify the optimal manifold, and basic
shrinking methods can sometimes do so quickly.
For smaller τ , a simple “continuation” strategy can help:
 0: Given target value τf , choose initial τ0 > τf , starting point x and
                                                                    ¯
    factor σ ∈ (0, 1).
 k: Find approx solution x(τk ) of minx f (x) + τ ψ(x), starting from x ;
                                                                      ¯
    if τk = τf then STOP;
    Set τk+1 ← max(τf , στk ) and x ← x(τk );
                                   ¯




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   57 / 82
Solution x(τ ) is often desired on a range of τ values anyway, so
    efforts for larger τ are not wasted.
    Accelerated methods such as FISTA are less sensitive to the “small τ ”
    issue.
    Not much analysis of this approach has been done. Better heuristics
    and theoretical support are needed.




Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   58 / 82
Stochastic Gradient + Regularization
Solve the regularized problem, but have only estimates of                        f (xk ).
We can combine dual averaging, stochastic gradient, and shrinking: see
Xiao (2010).
                  min φτ (x) := Eξ f (x; ξ) + τ ψ(x)
                               x
At iteration k choose ξk randomly and i.i.d from the ξ distribution, and
choose gk ∈ ∂f (xk ; ξk ). Use these to define the averaged subgradient
gk = k gi /(k + 1), and solve the subproblem
¯       i=1
                                                 γ
               xk+1 = arg min gk x + τ ψ(x) + √ x − x0 2 .
                                ¯T
                             x                    k
Same as earlier, but with regularizer ψ included explicitly.
Can prove convergence results for averaged iterates xk : roughly
                                                    ¯
                                            C
                          E φτ (¯k ) − φ∗ ≤ √ ,
                                x       τ
                                              k
where the expectation of φ is taken over the random number stream
ξ0 , ξ1 , . . . , ξk−1 .
 Stephen Wright (UW-Madison)       Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   59 / 82
Further Reading

  1   F. Bach. “Sparse methods for machine learning: Theory and algorithms” Tutorial
      at NIPS 2009. (Slides on web.)
  2   M. Fukushima and H. Mine. “A generalized proximal point algorithm for certain
      non-convex minimization problems.” International Journal of Systems Science, 12,
      pp. 989–1000, 1981.
  3   P. L. Combettes and V. R. Wajs. “Signal recovery by proximal forward-backward
      splitting.” Multiscale Modeling and Simulation, 4, pp. 1168–1200, 2005.
  4   S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. “Sparse reconstruction by
      separable approximation.” IEEE Transactions on Signal Processing, 57, pp.
      2479–2493, 2009.
  5   E. Cand`s, J. Romberg, and T. Tao. “Stable signal recovery for incomplete and
              e
      inaccurate measurements.” Communications in Pure and Applied Mathematics,
      59, pp. 1207–1223, 2006.
  6   L. Xiao. “Dual averaging methods for regularized stochastic learning and online
      optimization.” TechReport MSR-TR-2010-23, Microsoft Research, March 2010.



 Stephen Wright (UW-Madison)    Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   60 / 82
IV. Optimal Manifold Identification


When constraints x ∈ Ψ or a nonsmooth regularizer ψ(x) are present,
identification of the manifold on which x ∗ lies can improve algorithm
performance, by focusing attention on a reduced space. We can thus
evaluate partial gradients and Hessians, restricted to just this space.
For nonsmooth regularizer ψ, the active manifold is a smooth surface
passing through x ∗ along which ψ is smooth.
Example: for ψ(x) = x 1 , have manifold consisting of z with
                             
                             ≥ 0 if xi∗ > 0
                             
                           zi ≤ 0 if xi∗ < 0
                             
                               = 0 if xi∗ = 0.
                             




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   61 / 82
For a polyhedral Ω, the active manifold is the face on which x ∗ lies.
Example: For Ω = [0, 1]n , active manifold consists of z with
                         
                         = 1
                                    if xi∗ = 1
                       zi = 0        if xi∗ = 0
                         
                            ∈ [0, 1] if xi∗ ∈ (0, 1).
                         



                                                                   M

                                                              x*




Can parametrize M with a single variable.

 Stephen Wright (UW-Madison)   Optimization in Machine Learning        NIPS Tutorial, 6 Dec 2010   62 / 82
Identification
Algorithms of shrinking / gradient projection type can identify the optimal
manifold M, or a good approximation to it, without knowing x ∗ .
HOW? Optimality conditions for the problem minx∈Ω f (x) are
                               x∗ −   f (x ∗ ) ∈ x ∗ + NΩ (x ∗ ),
where NΩ (x) is the normal cone to Ω at x:
                     NΩ (x) = {s | s T (z − x) ≤ 0 for all z ∈ Ω}.
When M is defined appropriately, we find that
                    R(Ω, M) := {x + s | x ∈ M and s ∈ NΩ (x)}
has an interior in Rn . If x ∗ satisfies a nondegeneracy condition — that is,
− f (x ∗ ) ∈ ri NΩ (x ∗ ) (relative interior), we have
                                 x−      f (x) ∈ R(Ω, M)
for all x sufficiently close to x ∗ . From such points, projection recovers M:
                                 PΩ (x −        f (x)) ∈ M.
 Stephen Wright (UW-Madison)      Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   63 / 82
N
                                                                     x*−Df(x*)

                                                       x*




Stephen Wright (UW-Madison)   Optimization in Machine Learning       NIPS Tutorial, 6 Dec 2010   64 / 82
The non-averaged iterates from gradient projection methods eventually lie
on the correct manifold M. The same is true for dual-averaging methods
(where the gradient term is averaged over all steps).
When we have a nonsmooth regularizer ψ, instead of Ω, the analogous
property is that the solution of the shrink subproblem, for some fixed
positive α, lies on the optimal manifold M.
Under reasonable conditions on αk , the “basic” shrink method eventually
has all its iterates on M. Also true for dual-averaged methods.
In practice, often use heuristics for deciding when M (or a small superset)
has been reached. If a distance-to-solution bound is available, and if
Lipschitz constant for f is known, can make this decision more rigorous.




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   65 / 82
How Might This Help?
Consider again logistic regression with regularizer ψ(w ) = w 1 .
                                                             
                                           m
                        1                                 T
            L(w ) = −            w T xi −    log(1 + e w xi ) .
                        m
                               yi =−1                i=1


Requires calculation of Xw where X = [xiT ]m . (Can be cheap if w has
                                           i=1
few nonzeros.) For gradient, have
                                                                  T
                1                                  −(1 + e w xi )−1 ,            yi = −1,
         L(w ) = X T u, where ui =                          T
                m                                  (1 + e −w xi )−1 ,            yi = +1.

requires m exponentials, and a matrix-vector multiply by X (with a full
vector u).
If just a subset G of components needed, multiply by a column submatrix
   T
X·G — much cheaper than full gradient if |G|    n.
 Stephen Wright (UW-Madison)   Optimization in Machine Learning       NIPS Tutorial, 6 Dec 2010   66 / 82
Reduced Hessian



                                                                            T
                2          1                              e w xi
                    L(w ) = X T diag(v )X , where vi =                  .
                           n                           (1 + e w T xi )2
Often much cheaper to calculate |G| × |G| reduced Hessian than the full
Hessian.
Can use sampling (Byrd et al., 2010) to approximate the projected
Hessian: take a subset S ⊂ {1, 2, . . . , m} and use XSG in place of X·G .
Reduces evaluation cost by a factor |S|/m.




 Stephen Wright (UW-Madison)      Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   67 / 82
Higher-Order Shrinking Method
for problem minx f (x) + τ x          1.   Step k:
     Choose a subset Gk ⊃ {i | xk (i) = 0}
     Evaluate Gk f (xk ) and solve (in closed form):
                                1 T
      min       f (xk )T d +       d d + τ xk + d              1,    s.t. d(i) = 0 for i ∈ Gk .
                                                                                         /
        d                      2αk
     Repeat with a decreasing sequence of αk , until sufficient decrease;
           +
     Set xk = xk + d;
                                     +
     Define Ck ⊂ Gk by Ck := {i | xk (i) = 0}.
     Calculate reduced Newton step in Ck components with (approximate)
     reduced Hessian matrix HCk Ck and reduced gradient Ck f , evaluated
          +
     at xk . Take this reduced step if is gives an improvement in objective,
                                                   +
     giving xk+1 . Otherwise, settle for xk+1 ← xk .
Can be generalized to other separable regularizers ψ(x). Choice of subset
must conform to separation in regularizer, and all components must be
checked periodically.
 Stephen Wright (UW-Madison)      Optimization in Machine Learning      NIPS Tutorial, 6 Dec 2010   68 / 82
Newton-Like Methods

Newton-like methods are ubiquitous in smooth optimization — motivated
by second-order Taylor series.
(Basic) Newton’s Method steps obtained from

                                                      1
   xk+1 = arg min f (xk ) +       f (xk )T (z − xk ) + (z − xk )T Hk (z − xk ),
                      z                               2
where Hk = 2 f (xk ). Near a local minimizer with second-order sufficient
conditions, converges superlinearly: xk+1 − x ∗ = o( xk − x ∗ ).
Can modify by
     adding a prox-term e.g. multiple of z − xk                   2;

     Adding a trust-region constraint z − xk ≤ ∆k (equivalent);
     doing a line search along d.


 Stephen Wright (UW-Madison)   Optimization in Machine Learning        NIPS Tutorial, 6 Dec 2010   69 / 82
Choices of Hk
Hessian 2 f often expensive to evaluate, so can use approximations, e.g.
    Re-use 2 f from a previous iterate.
    Use a sampled approximation to 2 f (xk ) (see above).
    Use a diagonal approximation — at least gets the scaling right. See
    e.g. Barzilai-Borwein above.
    quasi-Newton methods (BFGS, L-BFGS), which define Hk to be a
    matrix that mimics the bevavior of the true Hessian over previous
    steps. Requires only gradients f .
    Other approximations that exploit the structure of the problem. e.g.
    for nonlinear least squares f (x) = (1/2) r (x) 2 for r : Rn → Rm ,
                                                    2
    Hessian is
                                                                   m
                         2
                             f (x) = J(x)T J(x) + (1/2)                 ri (x)   2
                                                                                     ri (x),
                                                                  i=1

     where J is the m × n Jacobian of r . In Gauss-Newton method, use
     H(x) = J(x)T J(x).
 Stephen Wright (UW-Madison)        Optimization in Machine Learning       NIPS Tutorial, 6 Dec 2010   70 / 82
Higher-Order Information and Constraints

Higher-order methods can be extended to presence of constraints x ∈ Ω or
regularizers ψ(x) provided these elements can be incorporated explicitly
into the subproblems. e.g. for constraints
                                                      1
   xk+1 = arg min f (xk ) +       f (xk )T (z − xk ) + (z − xk )T Hk (z − xk ),
                    z∈Ω                               2
and for regularizers
                                             1
xk+1 = arg min f (xk ) + f (xk )T (z − xk ) + (z − xk )T Hk (z − xk ) + τ ψ(z).
            z                                2
These subproblems are typically harder to solve than the “shrink”
subproblem, unless Hk is simple (e.g. diagonal).
In practice, can do manifold identification and reduction (see above), or
form simpler approximations to Ω (but then may need to incorporate
curvature information about Ω into Hk to ensure fast convergence).
 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   71 / 82
Solving for xk+1


When Hk positive definite, can solve Newton equations explicitly by solving

                               Hk (z − xk ) = − f (xk ).

Alternatively, apply conjugate gradient to this system to get an inexact
solution. Each iterate requires a multiplication by Hk .
Can precondition CG, e.g. by using sample approximations, structured
approximations, or Krylov subspace information gathered at previous
evaluations of 2 f .
L-BFGS stores Hk in implicit form, by means of 2m vectors in Rn , for a
small parameter m (e.g. 5). Recovers solution of the equation above via
2m inner products.



 Stephen Wright (UW-Madison)     Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   72 / 82
“Higher-Order” Methods in ML
Several approaches tried (in addition to sampling and reduced-space
techniques discussed above).
     Bordes et al. (2009, corrected 2010) for SVM
                                                m
                               τwT w +                (w ; xi , yi ),
                                              i=1

     scales the stochastic gradient step with a diagonal Hk , obtained from
     finite differences of the last estimated gradient over the last step.
     Schraudolph et al. (AISTATS 2007) “online BFGS” uses conventional
     quasi-Newton update formulae (e.g. L-BFGS) based on estimated
     gradient differences over previous steps.

Since the gradients are so inexact (based on just one data point), both in
update and right-hand side of the step equations, these methods are really
stochastic gradient with interesting scaling, rather than quasi-Newton in
the conventional sense.
 Stephen Wright (UW-Madison)   Optimization in Machine Learning         NIPS Tutorial, 6 Dec 2010   73 / 82
Further Reading

  1   L. Bottou and A. Moore, “Learning with large datasets,” Tutorial at NIPS 2007
      (slides on web).
  2   J. Nocedal and S. J. Wright, Numerical Optimization, 2nd edition, Springer, 2006.
  3   R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. “On the use of stochastic
      hessian information in unconstrained optimization.” Technical Report,
      Northwestern University, June 2010.
  4   M. Fisher, J. Nocedal, Y. Tremolet, and S. J. Wright. “Data assimilation in
      weather forecasting: A case study in PDE-constrained optimization.” Optimization
      and Engineering, 10, pp. 409–426, 2009.
  5   A. S. Lewis and S. J. Wright. “Identifying activity.” Technical report, ORIE,
      Cornell University. Revised April 2010.
  6   S. J. Wright, “Accelerated block-coordinate relaxation for regularized
      optimization,” Technical Report, August 2010.
  7   W. Hare and A. Lewis. “Identifying active constraints via partial smoothness and
      prox-regularity.” Journal of Convex Analysis, 11, pp. 251–266, 2004.


 Stephen Wright (UW-Madison)     Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   74 / 82
V. Decomposition / Coordinate Relaxation



For min f (x), at iteration k, choose a subset Gk ⊂ {1, 2, . . . , n} and take
a step dk only in these components. i.e. fix dk (i) = 0 for i ∈ Gk .
                                                                /
Gives more manageable subproblem sizes, in practice.
Can
      take a reduced gradient step in the Gk components;
      take multiple “inner iterations”
      actually solve the reduced subproblem in the space defined by Gk .




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   75 / 82
Constraints and Regularizers Complicate Things


For minx∈Ω f (x), need to put enough components into Gk to stay feasible,
as well as make progress.
Example: min f (x1 , x2 ) with x1 + x2 = 1. Relaxation with Gk = {1} or
Gk = {2} won’t work.
For separable regularizer (e.g. Group LASSO) with

                               ψ(x) =             ψg (x[g ] ),
                                          g ∈G

need to ensure that Gk is a union of the some index subsets [g ]. i.e. the
relaxation components must be consonant with the partitioning.




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   76 / 82
Decomposition and Dual SVM


Decomposition has long been popular for solving the dual (QP)
formulation of SVM, since the number of variables (= number of training
examples) is sometimes very large.

     SMO: Each Gk has two components.
     LIBSVM: SMO approach (still |Gk | = 2), with different heuristic for
     choosing Gk .
     LASVM: Again |Gk | = 2, with focus on online setting.
     SVM-light: Small |Gk | (default 10).
     GPDT: Larger |Gk | (default 400) with gradient projection solver as
     inner loop.




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   77 / 82
Choice of Gk and Convergence Results
Some methods (e.g. Tseng and Yun, 2010) require Gk to be chosen so
that the improvement in subproblem objective obtained over the subset Gk
is at least a fixed fraction of the improvement available over the whole
space. Undesirable, since to check it, usually need to evaluate the full
gradient f (xk ).
Alternative is a generalized Gauss-Seidel requirement, where each
coordinate is “touched” at least once every T iterations:
                      Gk ∪ Gk+1 ∪ . . . ∪ Gk+T −1 = {1, 2, . . . , n}.
Can show global convergence (e.g. Tseng and Yun, 2009; Wright, 2010).
There are also results on
     global linear convergence rates
     optimal manifold identification
     fast local convergence for an algorithm that takes reduced steps on
     the estimated optimal manifold.
All deterministic analyses.
 Stephen Wright (UW-Madison)      Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   78 / 82
Stochastic Coordinate Descent
Analysis tools of stochastic gradient may be useful. If steps have the form
xk+1 = xk − αk gk , where

                                          [ f (xk )]i         if i ∈ Gk
                               gk (i) =
                                          0                   otherwise,

With suitable random selection of Gk can ensure that gk (appropriately
scaled) is an unbiased estimate of f (xk ). Hence can apply SGD
techniqes discussed earlier, to choose αk and obtain convergence.
Nesterov (2010) proposes another randomized approach for the
unconstrained problem with known separate Lipschitz constants Li :

                  ∂fi              ∂fi
                      (x + hei ) −     (x) ≤ Li |h|, i = 1, 2, . . . , n.
                  ∂xi              ∂xi

(Works with blocks too, instead of individual components.)
 Stephen Wright (UW-Madison)         Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   79 / 82
At step k:
                                                                                       n
     Choose index ik ∈ {1, 2, . . . , n} with probability pi := Li /(                  j=1 Lj );
     Take gradient step in ik component:
                                                     1 ∂f
                               xk+1 = xk −                  ei .
                                                    Lik ∂xik k

Basic convergence result:
                                                          C
                               E [f (xk )] − f ∗ ≤          .
                                                          k
As for SA (earlier) but without any strong convexity assumption.
Can also get linear convergence results (in expectation) by assuming
strong convexity in f , according to different norms.
Can also accelerate in the usual fashion (see above), to improve expected
convergence rate to O(1/k 2 ).

 Stephen Wright (UW-Madison)   Optimization in Machine Learning    NIPS Tutorial, 6 Dec 2010   80 / 82
Further Reading



  1   P. Tseng and S. Yun, “A coordinate gradient descent method for linearly
      constrained smooth optimization and support vector machines training.”
      Computational Optimization and Applications, 47, pp. 179–206, 2010.
  2   P. Tseng and S. Yun, “A coordinate gradient descent method for nonsmooth
      separable minimization.” Mathematical Programming, Series B, 117. pp.
      387–423, 2009.
  3   S. J. Wright, “Accelerated block-coordinate relaxation for regularized
      optimization.” Technical Report, UW-Madison, August 2010.
  4   Y. Nesterov, “Efficiency of coordinate descent methods on huge-scale optimization
      problems.” CORE Discussion Paper 2010/2, CORE, UCL, January 2010.




 Stephen Wright (UW-Madison)     Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   81 / 82
Conclusions


We’ve surveyed a number of topics in algorithmic fundamentals, with an
eye on recent developments, and on topics of relevance (current or future)
to machine learning.
The talk was far from exhaustive. Literature on Optimization in ML is
huge and growing.
There is much more to be gained from the interaction between the two
areas.


                                           FIN




 Stephen Wright (UW-Madison)   Optimization in Machine Learning   NIPS Tutorial, 6 Dec 2010   82 / 82

More Related Content

What's hot

Tensor Decomposition and its Applications
Tensor Decomposition and its ApplicationsTensor Decomposition and its Applications
Tensor Decomposition and its ApplicationsKeisuke OTAKI
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes FamilyKota Matsui
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayesPhong Vo
 
Intro probability 4
Intro probability 4Intro probability 4
Intro probability 4Phong Vo
 
On estimating the integrated co volatility using
On estimating the integrated co volatility usingOn estimating the integrated co volatility using
On estimating the integrated co volatility usingkkislas
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processesKazuki Fujikawa
 
MCMC and likelihood-free methods
MCMC and likelihood-free methodsMCMC and likelihood-free methods
MCMC and likelihood-free methodsChristian Robert
 
Neural Processes
Neural ProcessesNeural Processes
Neural ProcessesSangwoo Mo
 
Integration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methodsIntegration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methodsMercier Jean-Marc
 
Principal Component Analysis for Tensor Analysis and EEG classification
Principal Component Analysis for Tensor Analysis and EEG classificationPrincipal Component Analysis for Tensor Analysis and EEG classification
Principal Component Analysis for Tensor Analysis and EEG classificationTatsuya Yokota
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical MethodsChristian Robert
 
Matrix Computations in Machine Learning
Matrix Computations in Machine LearningMatrix Computations in Machine Learning
Matrix Computations in Machine Learningbutest
 
2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmmnozomuhamada
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer visionzukun
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsChristian Robert
 

What's hot (20)

Tensor Decomposition and its Applications
Tensor Decomposition and its ApplicationsTensor Decomposition and its Applications
Tensor Decomposition and its Applications
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes Family
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayes
 
Intro probability 4
Intro probability 4Intro probability 4
Intro probability 4
 
On estimating the integrated co volatility using
On estimating the integrated co volatility usingOn estimating the integrated co volatility using
On estimating the integrated co volatility using
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
 
MCMC and likelihood-free methods
MCMC and likelihood-free methodsMCMC and likelihood-free methods
MCMC and likelihood-free methods
 
Neural Processes
Neural ProcessesNeural Processes
Neural Processes
 
Integration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methodsIntegration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methods
 
Principal Component Analysis for Tensor Analysis and EEG classification
Principal Component Analysis for Tensor Analysis and EEG classificationPrincipal Component Analysis for Tensor Analysis and EEG classification
Principal Component Analysis for Tensor Analysis and EEG classification
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
 
Matrix Computations in Machine Learning
Matrix Computations in Machine LearningMatrix Computations in Machine Learning
Matrix Computations in Machine Learning
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmm
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 

Viewers also liked

Optimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsOptimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsNYC Predictive Analytics
 
Trust Models for RDF Data: Semantics and Complexity - AAAI2015
Trust Models for RDF Data: Semantics and Complexity - AAAI2015Trust Models for RDF Data: Semantics and Complexity - AAAI2015
Trust Models for RDF Data: Semantics and Complexity - AAAI2015Valeria Fionda
 
Formal Maps and their Algebra
Formal Maps and their AlgebraFormal Maps and their Algebra
Formal Maps and their AlgebraValeria Fionda
 
Ancestral Causal Inference - NIPS 2016 poster
Ancestral Causal Inference - NIPS 2016 posterAncestral Causal Inference - NIPS 2016 poster
Ancestral Causal Inference - NIPS 2016 posterSara Magliacane
 
ISWC 2012 "Efficient execution of top-k SPARQL queries"
ISWC 2012 "Efficient execution of top-k SPARQL queries"ISWC 2012 "Efficient execution of top-k SPARQL queries"
ISWC 2012 "Efficient execution of top-k SPARQL queries"Sara Magliacane
 
Ancestral Causal Inference - WIML 2016 @ NIPS
Ancestral Causal Inference - WIML 2016 @ NIPSAncestral Causal Inference - WIML 2016 @ NIPS
Ancestral Causal Inference - WIML 2016 @ NIPSSara Magliacane
 
Alexis Ohanian talks about the early days of reddit at Mass Challenge
Alexis Ohanian talks about the early days of reddit at Mass ChallengeAlexis Ohanian talks about the early days of reddit at Mass Challenge
Alexis Ohanian talks about the early days of reddit at Mass ChallengeAlexis Ohanian
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inferenceKemal İnciroğlu
 
Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...Sara Magliacane
 
Regression: A skin-deep dive
Regression: A skin-deep diveRegression: A skin-deep dive
Regression: A skin-deep diveabulyomon
 
Simple Linear Regression
Simple Linear RegressionSimple Linear Regression
Simple Linear RegressionSharlaine Ruth
 
Simple linear regression and correlation
Simple linear regression and correlationSimple linear regression and correlation
Simple linear regression and correlationShakeel Nouman
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regressionMaria Theresa
 
Simple Linear Regression (simplified)
Simple Linear Regression (simplified)Simple Linear Regression (simplified)
Simple Linear Regression (simplified)Haoran Zhang
 
NG BB 36 Simple Linear Regression
NG BB 36 Simple Linear RegressionNG BB 36 Simple Linear Regression
NG BB 36 Simple Linear RegressionLeanleaders.org
 
Methods Of Cooling Of Electrical Machines
Methods Of Cooling Of Electrical MachinesMethods Of Cooling Of Electrical Machines
Methods Of Cooling Of Electrical MachinesRahuldey1991
 

Viewers also liked (20)

Optimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive AnalyticsOptimization: A Framework for Predictive Analytics
Optimization: A Framework for Predictive Analytics
 
Trust Models for RDF Data: Semantics and Complexity - AAAI2015
Trust Models for RDF Data: Semantics and Complexity - AAAI2015Trust Models for RDF Data: Semantics and Complexity - AAAI2015
Trust Models for RDF Data: Semantics and Complexity - AAAI2015
 
Formal Maps and their Algebra
Formal Maps and their AlgebraFormal Maps and their Algebra
Formal Maps and their Algebra
 
Ancestral Causal Inference - NIPS 2016 poster
Ancestral Causal Inference - NIPS 2016 posterAncestral Causal Inference - NIPS 2016 poster
Ancestral Causal Inference - NIPS 2016 poster
 
ISWC 2012 "Efficient execution of top-k SPARQL queries"
ISWC 2012 "Efficient execution of top-k SPARQL queries"ISWC 2012 "Efficient execution of top-k SPARQL queries"
ISWC 2012 "Efficient execution of top-k SPARQL queries"
 
Ancestral Causal Inference - WIML 2016 @ NIPS
Ancestral Causal Inference - WIML 2016 @ NIPSAncestral Causal Inference - WIML 2016 @ NIPS
Ancestral Causal Inference - WIML 2016 @ NIPS
 
Alexis Ohanian talks about the early days of reddit at Mass Challenge
Alexis Ohanian talks about the early days of reddit at Mass ChallengeAlexis Ohanian talks about the early days of reddit at Mass Challenge
Alexis Ohanian talks about the early days of reddit at Mass Challenge
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
 
Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...Talk: Joint causal inference on observational and experimental data - NIPS 20...
Talk: Joint causal inference on observational and experimental data - NIPS 20...
 
Regression: A skin-deep dive
Regression: A skin-deep diveRegression: A skin-deep dive
Regression: A skin-deep dive
 
Chapter13
Chapter13Chapter13
Chapter13
 
C2.1 intro
C2.1 introC2.1 intro
C2.1 intro
 
Simple Linear Regression
Simple Linear RegressionSimple Linear Regression
Simple Linear Regression
 
Ch14
Ch14Ch14
Ch14
 
Simple linear regression and correlation
Simple linear regression and correlationSimple linear regression and correlation
Simple linear regression and correlation
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Chapter13
Chapter13Chapter13
Chapter13
 
Simple Linear Regression (simplified)
Simple Linear Regression (simplified)Simple Linear Regression (simplified)
Simple Linear Regression (simplified)
 
NG BB 36 Simple Linear Regression
NG BB 36 Simple Linear RegressionNG BB 36 Simple Linear Regression
NG BB 36 Simple Linear Regression
 
Methods Of Cooling Of Electrical Machines
Methods Of Cooling Of Electrical MachinesMethods Of Cooling Of Electrical Machines
Methods Of Cooling Of Electrical Machines
 

Similar to NIPS2010: optimization algorithms in machine learning

The Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionThe Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionPedro222284
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsFrank Nielsen
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector MachinesEdgar Marca
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Alexander Litvinenko
 
Optimal multi-configuration approximation of an N-fermion wave function
 Optimal multi-configuration approximation of an N-fermion wave function Optimal multi-configuration approximation of an N-fermion wave function
Optimal multi-configuration approximation of an N-fermion wave functionjiang-min zhang
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata
 
On fixed point theorem in fuzzy metric spaces
On fixed point theorem in fuzzy metric spacesOn fixed point theorem in fuzzy metric spaces
On fixed point theorem in fuzzy metric spacesAlexander Decker
 
Expert Lecture on GPS at UIET, CSJM, Kanpur
Expert Lecture on GPS at UIET, CSJM, KanpurExpert Lecture on GPS at UIET, CSJM, Kanpur
Expert Lecture on GPS at UIET, CSJM, KanpurSuddhasheel GHOSH, PhD
 
Least squares support Vector Machine Classifier
Least squares support Vector Machine ClassifierLeast squares support Vector Machine Classifier
Least squares support Vector Machine ClassifierRaj Sikarwar
 
Mit2 092 f09_lec23
Mit2 092 f09_lec23Mit2 092 f09_lec23
Mit2 092 f09_lec23Rahman Hakim
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
A series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyA series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyFrank Nielsen
 
Seismic data processing lecture 3
Seismic data processing lecture 3Seismic data processing lecture 3
Seismic data processing lecture 3Amin khalil
 

Similar to NIPS2010: optimization algorithms in machine learning (20)

The Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability DistributionThe Multivariate Gaussian Probability Distribution
The Multivariate Gaussian Probability Distribution
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributions
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
 
QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...
QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...
QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...
 
Optimal multi-configuration approximation of an N-fermion wave function
 Optimal multi-configuration approximation of an N-fermion wave function Optimal multi-configuration approximation of an N-fermion wave function
Optimal multi-configuration approximation of an N-fermion wave function
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
 
On fixed point theorem in fuzzy metric spaces
On fixed point theorem in fuzzy metric spacesOn fixed point theorem in fuzzy metric spaces
On fixed point theorem in fuzzy metric spaces
 
Expert Lecture on GPS at UIET, CSJM, Kanpur
Expert Lecture on GPS at UIET, CSJM, KanpurExpert Lecture on GPS at UIET, CSJM, Kanpur
Expert Lecture on GPS at UIET, CSJM, Kanpur
 
Statistical Physics Assignment Help
Statistical Physics Assignment HelpStatistical Physics Assignment Help
Statistical Physics Assignment Help
 
The Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal FunctionThe Gaussian Hardy-Littlewood Maximal Function
The Gaussian Hardy-Littlewood Maximal Function
 
Least squares support Vector Machine Classifier
Least squares support Vector Machine ClassifierLeast squares support Vector Machine Classifier
Least squares support Vector Machine Classifier
 
Mit2 092 f09_lec23
Mit2 092 f09_lec23Mit2 092 f09_lec23
Mit2 092 f09_lec23
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
1010n3a
1010n3a1010n3a
1010n3a
 
Lecture 9 f17
Lecture 9 f17Lecture 9 f17
Lecture 9 f17
 
A series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyA series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropy
 
Seismic data processing lecture 3
Seismic data processing lecture 3Seismic data processing lecture 3
Seismic data processing lecture 3
 
chapter3.ppt
chapter3.pptchapter3.ppt
chapter3.ppt
 

More from zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVzukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Informationzukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statisticszukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionzukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluationzukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-softwarezukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptorszukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectorszukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-introzukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video searchzukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video searchzukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video searchzukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learningzukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionzukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick startzukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysiszukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structureszukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities zukun
 

More from zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 

Recently uploaded

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 

Recently uploaded (20)

Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 

NIPS2010: optimization algorithms in machine learning

  • 1. Optimization Algorithms in Machine Learning Stephen Wright University of Wisconsin-Madison NIPS Tutorial, 6 Dec 2010 Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 1 / 82
  • 2. Optimization Optimization is going through a period of growth and revitalization, driven largely by new applications in many areas. Standard paradigms (LP, QP, NLP, MIP) are still important, along with general-purpose software, enabled by modeling languages that make the software easier to use. However, there is a growing emphasis on “picking and choosing” algorithmic elements to fit the characteristics of a given application — building up a suitable algorithm from a “toolkit” of components. It’s more important than ever to understand the fundamentals of algorithms as well as the demands of the application, so that good choices are made in matching algorithms to applications. We present a selection of algorithmic fundamentals in this tutorial, with an emphasis on those of current and potential interest in machine learning. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 2 / 82
  • 3. Topics I. First-order Methods II. Stochastic and Incremental Gradient Methods III. Shrinking/Thresholding for Regularized Formulations IV. Optimal Manifold Identification and Higher-Order Methods. V. Decomposition and Coordinate Relaxation Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 3 / 82
  • 4. I. First-Order Methods min f (x), with smooth convex f . Usually assume 2 µI f (x) LI for all x, with 0 ≤ µ ≤ L. (L is thus a Lipschitz constant on the gradient f .) µ > 0 ⇒ strongly convex. Have 1 f (y ) − f (x) − f (x)T (y − x) ≥ µ y − x 2 . 2 (Mostly assume · := · 2 .) Define conditioning κ := L/µ. Sometimes discuss convex quadratic f : 1 f (x) = x T Ax, where µI A LI . 2 Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 4 / 82
  • 5. What’s the Setup? Assume in this part of talk that we can evaluate f and f at each iterate xi . But we are interested in extending to broader class of problems: nonsmooth f ; f not available; only an estimate of the gradient (or subgradient) is available; impose a constraint x ∈ Ω for some simple set Ω (e.g. ball, box, simplex); a nonsmooth regularization term may be added to the objective f . Focus on algorithms that can be adapted to these circumstances. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 5 / 82
  • 6. Gradient xk+1 = xk − αk f (xk ), for some αk > 0. Different ways to identify an appropriate αk . 1 Hard: Interpolating scheme with safeguarding to identify an approximate minimizing αk . 2 Easy: Backtracking. α, 1 α, 1 α, 1 α, ... until a sufficient decrease in ¯ 2¯ 4¯ 8¯ f is obtained. 3 Trivial: Don’t test for function decrease. Use rules based on L and µ. Traditional analysis for 1 and 2: Usually yields global convergence at unspecified rate. The “greedy” strategy of getting good decrease from the current search direction is appealing, and may lead to better practical results. Analysis for 3: Focuses on convergence rate, and leads to accelerated multistep methods. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 6 / 82
  • 7. Constant (Short) Steplength By elementary use of Taylor’s theorem, obtain αk f (xk+1 ) ≤ f (xk ) − αk 1 − L f (xk ) 2 . 2 2 For αk ≡ 1/L, have 1 f (xk ) 2 . f (xk+1 ) ≤ f (xk ) − 2 2L It follows by elementary arguments (see e.g. Nesterov 2004) that 2L x0 − x 2 f (xk+1 ) − f (x ∗ ) ≤ . k +1 The classic 1/k convergence rate! By assuming µ > 0, can set αk ≡ 2/(µ + L) and get a linear (geometric) rate: Much better than sublinear, in the long run 2k 2k L−µ 2 xk − x ∗ 2 ≤ x0 − x ∗ 2 = 1− x0 − x ∗ 2 . L+µ κ+1 Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 7 / 82
  • 8. The 1/k 2 Speed Limit Nesterov (2004) gives a simple example of a smooth function for which no method that generates iterates of the form xk+1 = xk − αk f (xk ) can converge at a rate faster than 1/k 2 , at least for its first n/2 iterations. Note that xk+1 ∈ x0 + span( f (x0 ), f (x1 ), . . . , f (xk )).     2 −1 0 0 ... ... 0 1 −1 2 −1 0 . . . . . . 0 0     A =  0 −1 2 −1 0 . . . 0 , e1 = 0     .. .. .. . .    . . .  . 0 ... 0 −1 2 0 and set f (x) = (1/2)x T Ax − e1 x. The solution has x ∗ (i) = 1 − i/(n + 1). T If we start at x0 = 0, each f (xk ) has nonzeros only in its first k + 2 entries. Hence, xk+1 (i) = 0 for i = k + 3, k + 4, . . . , n. Can show 3L x0 − x ∗ 2 f (xk ) − f ∗ ≥ . 32(k + 1)2 Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 8 / 82
  • 9. Exact minimizing αk : Faster rate? Take αk to be the exact minimizer of f along − f (xk ). Does this yield a better rate of linear convergence? Consider the convex quadratic f (x) = (1/2)x T Ax. (Thus x ∗ = 0 and f (x ∗ ) = 0.) Here κ is the condition number of A. We have f (xk ) = Axk . Exact minimizing αk : xk A2 xk T 1 αk = T A3 x = arg min (xk − αAxk )T A(xk − αAxk ), xk k α 2 1 1 which is in the interval L, µ . Can show that 2k ∗ 2 f (xk ) − f (x ) ≤ 1− [f (x0 ) − f (x ∗ )]. κ+1 No improvement in the linear rate over constant steplength. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 9 / 82
  • 10. Multistep Methods: Heavy-Ball Enhance the search direction by including a contribution from the previous step. Consider first constant step lengths: xk+1 = xk − α f (xk ) + β(xk − xk−1 ) Analyze by defining a composite iterate vector: xk − x ∗ wk := . xk−1 − x ∗ Thus −α 2 f (x ∗ ) + (1 + β)I −βI wk+1 = Bwk + o( wk ), B := . I 0 Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 10 / 82
  • 11. B has same eigenvalues as −αΛ + (1 + β)I −βI , Λ = diag(λ1 , λ2 , . . . , λn ), I 0 where λi are the eigenvalues of 2 f (x ∗ ). Choose α, β to explicitly minimize the max eigenvalue of B, obtain 2 4 1 2 α= √ , β= 1− √ . L (1 + 1/ κ)2 κ+1 Leads to linear convergence for xk − x ∗ with rate approximately 2 1− √ . κ+1 Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 11 / 82
  • 12. Summary: Linear Convergence, Strictly Convex f Best steepest descent: Linear rate approx (1 − 2/κ); √ Heavy-ball: linear rate approx (1 − 2/ κ). Big difference! To reduce xk − x ∗ by a factor , need k large enough that k 2 κ 1− ≤ ⇐ k≥ | log | (steepest descent) κ 2 √ 2 k κ 1− √ ≤ ⇐ k≥ | log | (heavy-ball) κ 2 √ A factor of κ difference. e.g. if κ = 100, need 10 times fewer steps. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 12 / 82
  • 13. Conjugate Gradient Basic step is xk+1 = xk + αk pk , pk = − f (xk ) + γk pk−1 . We can identify it with heavy-ball by setting βk = αk γk /αk−1 . However, CG can be implemented in a way that doesn’t require knowledge (or estimation) of L and µ. Choose αk to (approximately) miminize f along pk ; Choose γk by a variety of formulae (Fletcher-Reeves, Polak-Ribiere, etc), all of which are equivalent if f is convex quadratic. e.g. γk = f (xk ) 2 / f (xk−1 ) 2 . There is a rich convergence theory for f quadratic, including asymptotic √ linear convergence with rate approx 1 − 2/ κ. (Like heavy-ball.) See e.g. Chap. 5 of Nocedal & Wright (2006) and refs therein. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 13 / 82
  • 14. Accelerated First-Order Methods Accelerate the rate to 1/k 2 for weakly convex, while retaining the linear √ rate (related to κ) for strongly convex case. Nesterov (1983, 2004) describes a method that requires κ. 0: Choose x0 , α0 ∈ (0, 1); set y0 ← x0 ./ 1 k: xk+1 ← yk − L f (yk ); (*short-step gradient*) 2 2 solve for αk+1 ∈ (0, 1): αk+1 = (1 − αk+1 )αk + αk+1 /κ; set βk = αk (1 − αk )/(αk 2 +α k+1 ); set yk+1 ← xk+1 + βk (xk+1 − xk ). Still works for weakly convex (κ = ∞). FISTA (Beck & Teboulle 2007): 0: Choose x0 ; set y1 = x0 , t1 = 1; 1 k: xk ← yk − L f (yk ); 1 2 tk+1 ← 2 1+ 1 + 4tk ; tk −1 yk+1 ← xk + tk+1 (xk − xk−1 ). Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 14 / 82
  • 15. yk+2 xk+2 yk+1 xk+1 xk yk Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 15 / 82
  • 16. Convergence Results: Nesterov √ If α0 ≥ 1/ κ, have k 1 4L f (xk ) − f (x ∗ ) ≤ c1 min 1− √ , √ , κ ( L + c2 k)2 where constants c1 and c2 depend on x0 , α0 , L. Linear convergence at “heavy-ball” rate in strongly convex case, otherwise 1/k 2 . FISTA also achieves 1/k 2 rate. Analysis: Not intuitive. Based on bounding the difference between f and a quadratic approximation to it, at x ∗ . FISTA analysis is 2-3 pages. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 16 / 82
  • 17. A Non-Monotone Gradient Method: Barzilai-Borwein (Barzilai & Borwein 1988) BB is a gradient method, but with an unusual choice of αk . Allows f to increase (sometimes dramatically) on some steps. 2 xk+1 = xk − αk f (xk ), αk := arg min sk − αzk , α where sk := xk − xk−1 , zk := f (xk ) − f (xk−1 ). Explicitly, we have T sk zk αk = T . zk zk Note that for convex quadratic f = (1/2)x T Ax, we have T sk Ask αk = T ∈ [L−1 , µ−1 ]. sk A2 sk Hence, can view BB as a kind of quasi-Newton method, with the Hessian −1 approximated by αk I . Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 17 / 82
  • 18. Comparison: BB vs Greedy Steepest Descent Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 18 / 82
  • 19. Many BB Variants T T T T can use αk = sk sk /sk zk in place of αk = sk zk /zk zk ; alternate between these two formulae; calculate αk as above and hold it constant for 2, 3, or 5 successive steps; take αk to be the exact steepest descent step from the previous iteration. Nonmonotonicity appears essential to performance. Some variants get global convergence by requiring a sufficient decrease in f over the worst of the last 10 iterates. The original 1988 analysis in BB’s paper is nonstandard and illuminating (just for a 2-variable quadratic). In fact, most analyses of BB and related methods are nonstandard, and consider only special cases. The precursor of such analyses is Akaike (1959). More recently, see Ascher, Dai, Fletcher, Hager and others. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 19 / 82
  • 20. Primal-Dual Averaging (see Nesterov 2009) Basic step: k 1 γ xk+1 = arg min [f (xi ) + f (xi )T (x − xi )] + √ x − x0 2 x k +1 k i=0 γ = arg min ¯T gk x + √ x − x0 2 , x k k where gk := ¯ i=0 f (xi )/(k + 1) — the averaged gradient. The last term is always centered at the first iterate x0 . Gradient information is averaged over all steps, with equal weights. γ is constant - results can be sensitive to this value. The approach still works for convex nondifferentiable f , where f (xi ) is replaced by a vector from the subgradient ∂f (xi ). Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 20 / 82
  • 21. Convergence Properties Nesterov proves convergence for averaged iterates: k 1 xk+1 = ¯ xi . k +1 i=0 Provided the iterates and the solution x ∗ lie within some ball of radius D around x0 , we have C f (¯k+1 ) − f (x ∗ ) ≤ √ , x k where C depends on D, a uniform bound on f (x) , and γ (coefficient of stabilizing term). Note: There’s averaging in both primal (xi ) and dual ( f (xi )) spaces. Generalizes easily and robustly to the case in which only estimated gradients or subgradients are available. (Averaging smooths the errors in the individual gradient estimates.) Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 21 / 82
  • 22. Extending to the Constrained Case: x ∈ Ω How do these methods change when we require x ∈ Ω, with Ω closed and convex? Some algorithms and theory stay much the same, provided we can involve Ω explicity in the subproblems. Example: Primal-Dual Averaging for minx∈Ω f (x). γ xk+1 = arg min gk x + √ x − x0 2 , ¯T x∈Ω k k where gk := ¯ i=0 f (xi )/(k + 1). When Ω is a box, this subproblem is easy to solve. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 22 / 82
  • 23. Example: Nesterov’s Constant Step Scheme for minx∈Ω f (x). Requires just only calculation to be changed from the unconstrained version. 0: Choose x0 , α0 ∈ (0, 1); set y0 ← x0 , q ← 1/κ = µ/L. 1 1 k: xk+1 ← arg miny ∈Ω 2 y − [yk − L f (yk )] 2 ; 2 2 2 solve for αk+1 ∈ (0, 1): αk+1 = (1 − αk+1 )αk + qαk+1 ; 2 set βk = αk (1 − αk )/(αk + αk+1 ); set yk+1 ← xk+1 + βk (xk+1 − xk ). Convergence theory is unchanged. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 23 / 82
  • 24. Regularized Optimization (More Later) FISTA can be applied with minimal changes to the regularized problem min f (x) + τ ψ(x), x where f is convex and smooth, ψ convex and “simple” but usually nonsmooth, and τ is a positive parameter. Simply replace the gradient step by 2 L 1 xk = arg min x − yk − f (yk ) + τ ψ(x). x 2 L (This is the “shrinkage” step; when ψ ≡ 0 or ψ = · 1, can be solved cheaply.) More on this later. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 24 / 82
  • 25. Further Reading 1 Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer Academic Publishers, 2004. 2 A. Beck and M. Teboulle, “Gradient-based methods with application to signal recovery problems,” in press, 2010. (See Teboulle’s web site). 3 B. T. Polyak, Introduction to Optimization, Optimization Software Inc, 1987. 4 J. Barzilai and J. M. Borwein, “Two-point step size gradient methods,” IMA Journal of Numerical Analysis, 8, pp. 141-148, 1988. 5 Y. Nesterov, “Primal-dual subgradient methods for convex programs,” Mathematical Programming, Series B, 120, pp. 221-259, 2009. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 25 / 82
  • 26. II. Stochastic and Incremental Gradient Methods Still deal with (weakly or strongly) convex f . But change the rules: Allow f nonsmooth. Can’t get function values f (x). At any feasible x, have access only to an unbiased estimate of an element of the subgradient ∂f . Common settings are: f (x) = Eξ F (x, ξ), where ξ is a random vector with distribution P over a set Ξ. Also the special case: m f (x) = fi (x), i=1 where each fi is convex and nonsmooth. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 26 / 82
  • 27. Applications This setting is useful for machine learning formulations. Given data xi ∈ Rn and labels yi = ±1, i = 1, 2, . . . , m, find w that minimizes m τ ψ(w ) + (w ; xi , yi ), i=1 where ψ is a regularizer, τ > 0 is a parameter, and is a loss. For linear classifiers/regressors, have the specific form (w T xi , yi ). Example: SVM with hinge loss (w T xi , yi ) = max(1 − yi (w T xi ), 0) and ψ = · 1 or ψ = · 2 . 2 Example: Logistic regression: (w T xi , yi ) = log(1 + exp(yi w T xi )). In regularized version may have ψ(w ) = w 1 . Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 27 / 82
  • 28. Subgradients For each x in domain of f , g is a subgradient of f at x if f (z) ≥ f (x) + g T (z − x), for all z ∈ domf . Right-hand side is a supporting hyperplane. The set of subgradients is called the subdifferential, denoted by ∂f (x). When f is differentiable at x, have ∂f (x) = { f (x)}. We have strong convexity with modulus µ > 0 if 1 f (z) ≥ f (x)+g T (z−x)+ µ z−x 2 , for all x, z ∈ domf with g ∈ ∂f (x). 2 Generalizes the assumption 2 f (x) µI made earlier for smooth functions. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 28 / 82
  • 29. f supporting hyperplanes x Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 29 / 82
  • 30. “Classical” Stochastic Approximation Denote by G (x, ξ) ths subgradient estimate generated at x. For unbiasedness need Eξ G (x, ξ) ∈ ∂f (x). Basic SA Scheme: At iteration k, choose ξk i.i.d. according to distribution P, choose some αk > 0, and set xk+1 = xk − αk G (xk , ξk ). Note that xk+1 depends on all random variables up to iteration k, i.e. ξ[k] := {ξ1 , ξ2 , . . . , ξk }. When f is strongly convex, the analysis of convergence of E ( xk − x ∗ 2 ) is fairly elementary - see Nemirovski et al (2009). Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 30 / 82
  • 31. Rate: 1/k Define ak = 1 E ( xk − x ∗ 2 ). Assume there is M > 0 such that 2 E ( G (x, ξ) 2 ) ≤ M 2 for all x of interest. Thus 1 xk+1 − x ∗ 2 2 2 1 = xk − αk G (xk , ξk ) − x ∗ 2 2 1 1 2 = xk − x ∗ 2 − αk (xk − x ∗ )T G (xk , ξk ) + αk G (xk , ξk ) 2 . 2 2 2 Taking expectations, get 1 2 ak+1 ≤ ak − αk E [(xk − x ∗ )T G (xk , ξk )] + αk M 2 . 2 For middle term, have E [(xk − x ∗ )T G (xk , ξk )] = Eξ[k−1] Eξk [(xk − x ∗ )T G (xk , ξk )|ξ[k−1] ] = Eξ[k−1] (xk − x ∗ )T gk , Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 31 / 82
  • 32. ... where gk := Eξk [G (xk , ξk )|ξ[k−1] ] ∈ ∂f (xk ). By strong convexity, have 1 (xk − x ∗ )T gk ≥ f (xk ) − f (x ∗ ) + µ xk − x ∗ 2 ≥ µ xk − x ∗ 2 . 2 Hence by taking expectations, we get E [(xk − x ∗ )T gk ] ≥ 2µak . Then, substituting above, we obtain 1 2 ak+1 ≤ (1 − 2µαk )ak + αk M 2 2 When 1 αk ≡ , kµ a neat inductive argument (exercise!) reveals the 1/k rate: Q M2 ak ≤ , for Q := max x1 − x ∗ 2 , . 2k µ2 Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 32 / 82
  • 33. But... What if we don’t know µ? Or if µ = 0? The choice αk = 1/(kµ) requires strong convexity, with knowledge of the modulus µ. An underestimate of µ can greatly degrade the performance of the method (see example in Nemirovski et al. 2009). Now describe a Robust Stochastic Approximation approach, which has a √ rate 1/ k (in function value convergence), and works for weakly convex nonsmooth functions and is not sensitive to choice of parameters in the step length. This is the approach that generalizes to mirror descent. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 33 / 82
  • 34. Robust SA At iteration k: set xk+1 = xk − αk G (xk , ξk ) as before; set k i=1 αi xi xk = ¯ k . i=1 αi For any θ > 0 (not critical), choose step lengths to be θ αk = √ . M k Then f (¯k ) converges to f (x ∗ ) in expectation with rate approximately x (log k)/k 1/2 . The choice of θ is not critical. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 34 / 82
  • 35. Analysis of Robust SA The analysis is again elementary. As above (using i instead of k), have: 1 αi E [(xi − x ∗ )T gi ] ≤ ai − ai+1 + αi2 M 2 . 2 By convexity of f , and gi ∈ ∂f (xi ): f (x ∗ ) ≥ f (xi ) + giT (x ∗ − xi ), thus 1 αi E [f (xi ) − f (x ∗ )] ≤ ai − ai+1 + αi2 M 2 , 2 so by summing iterates i = 1, 2, . . . , k, telescoping, and using ak+1 > 0: k k 1 αi E [f (xi ) − f (x ∗ )] ≤ a1 + M 2 αi2 . 2 i=1 i=1 Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 35 / 82
  • 36. Thus dividing by i=1 αi : k k i=1 αi f (xi ) a1 + 1 M 2 2 i=1 αi E k − f (x ∗ ) ≤ 2 k . i=1 αi i=1 αi By convexity, we have k i=1 αi f (xi ) f (¯k ) ≤ x k , i=1 αi so obtain the fundamental bound: 1 k a1 + 2 M 2 2 i=1 αi E [f (¯k ) − f (x ∗ )] ≤ x k . i=1 αi Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 36 / 82
  • 37. θ By substituting αi = √ , M i we obtain 1 k 1 a1 + 2 θ2 E [f (¯k ) − f (x ∗ )] ≤ x θ k i=1 i 1 √ M i=1 i a1 + θ2 log(k + 1) ≤ θ √ M k a1 =M + θ log(k + 1) k −1/2 . θ That’s it! Other variants: constant stepsizes αk for a fixed “budget” of iterations; periodic restarting; averaging just over the recent iterates. All can be analyzed with the basic bound above. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 37 / 82
  • 38. Mirror Descent The step from xk to xk+1 can be viewed as the solution of a subproblem: 1 xk+1 = arg min G (xk , ξk )T (z − xk ) + z − xk 2 2, z 2αk a linear estimate of f plus a prox-term. This provides a route to handling constrained problems, regularized problems, alternative prox-functions. For the constrained problem minx∈Ω f (x), simply add the restriction z ∈ Ω to the subproblem above. In some cases (e.g. when Ω is a box), the subproblem is still easy to solve. We may use other prox-functions in place of (1/2) z − x 2 above. Such 2 alternatives may be particularly well suited to particular constraint sets Ω. Mirror Descent is the term used for such generalizations of the SA approaches above. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 38 / 82
  • 39. Mirror Descent cont’d Given constraint set Ω, choose a norm · (not necessarily Euclidean). Define the distance-generating function ω to be a strongly convex function on Ω with modulus 1 with respect to · , that is, (ω (x) − ω (z))T (x − z) ≥ x − z 2 , for all x, z ∈ Ω, where ω (·) denotes an element of the subdifferential. Now define the prox-function V (x, z) as follows: V (x, z) = ω(z) − ω(x) − ω (x)T (z − x). This is also known as the Bregman distance. We can use it in the subproblem in place of 1 · 2 : 2 1 xk+1 = arg min G (xk , ξk )T (z − xk ) + V (z, xk ). z∈Ω αk Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 39 / 82
  • 40. Bregman distance is the deviation from linearity: ω V(x,z) x z Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 40 / 82
  • 41. Bregman Distances: Examples For any Ω, we can use ω(x) := (1/2) x − x 2 , leading to prox-function ¯ 2 2. V (x, z) = (1/2) x − z 2 For the simplex Ω = {x ∈ Rn : x ≥ 0, n xi = 1}, we can use instead i=1 the 1-norm · 1 , choose ω to be the entropy function n ω(x) = xi log xi , i=1 leading to Bregman distance n V (x, z) = zi log(zi /xi ). i=1 These are the two most useful cases. Convergence results for SA can be generalized to mirror descent. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 41 / 82
  • 42. Incremental Gradient (See e.g. Bertsekas (2011) and references therein.) Finite sums: m f (x) = fi (x). i=1 Step k typically requires choice of one index ik ∈ {1, 2, . . . , m} and evaluation of fik (xk ). Components ik are selected sometimes randomly or cyclically. (Latter option does not exist in the setting f (x) := Eξ F (x; ξ).) There are incremental versions of the heavy-ball method: xk+1 = xk − αk fik (xk ) + β(xk − xk−1 ). Approach like dual averaging: assume a cyclic choice of ik , and approximate f (xk ) by the average of fi (x) over the last m iterates: m αk xk+1 = xk − fik−l+1 (xk−l+1 ). m l=1 Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 42 / 82
  • 43. Achievable Accuracy Consider the basic incremental method: xk+1 = xk − αk fik (xk ). How close can f (xk ) come to f (x ∗ ) — deterministically (not just in expectation). Bertsekas (2011) obtains results for constant steps αk ≡ α. cyclic choice of ik : lim inf f (xk ) ≤ f (x ∗ ) + αβm2 c 2 . k→∞ random choice of ik : lim inf f (xk ) ≤ f (x ∗ ) + αβmc 2 . k→∞ where β is close to 1 and c is a bound on the Lipschitz constants for fi . (Bertsekas actually proves these results in the more general context of regularized optimization - see below.) Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 43 / 82
  • 44. Applications to SVM SA techniques have an obvious application to linear SVM classification. In fact, they were proposed in this context and analyzed independently by researchers in the ML community for some years. Codes: SGD (Bottou), PEGASOS (Shalev-Schwartz et al, 2007). Tutorial: Stochastic Optimization for Machine Learning, Tutorial by N. Srebro and A. Tewari, ICML 2010 for many more details on the connections between stochastic optimization and machine learning. Related Work: Zinkevich (ICML, 2003) on online convex programming. Aiming to approximate the minimize the average of a sequence of convex functions, presented sequentially. No i.i.d. assumption, regret-based analysis. Take steplengths of size O(k −1/2 ) in gradient fk (xk ) of latest convex function. Average regret is O(k −1/2 ). Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 44 / 82
  • 45. Further Reading 1 A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on Optimization, 19, pp. 1574-1609, 2009. 2 D. P. Bertsekas, “Incremental gradient, subgradient, and proximal methods for convex optimization: A Survey,” Chapter 4 in Optimization and Machine Learning, upcoming volume edited by S. Nowozin, S. Sra, and S. J. Wright (2011). 3 A. Juditsky and A. Nemirovski, “ First-order methods for nonsmooth convex large-scale optimization. I: General-purpose methods,” Chapter 5 in Optimization and Machine Learning (2011). 4 A. Juditsky and A. Nemirovski, “ First-order methods for nonsmooth convex large-scale optimization. I: Utilizing problem structure,” Chapter 6 in Optimization and Machine Learning (2011). 5 O. L. Mangasarian and M. Solodov, “Serial and parallel backpropagation convergencevia nonmonotone perturbed minimization,” Optimization Methods and Software 4 (1994), pp. 103–116. 6 D. Blatt, A. O. Hero, and H. Gauchman, “A convergent incremental gradient method with constant step size,” SIAM Journal on Optimization 18 (2008), pp. 29–51. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 45 / 82
  • 46. III. Shrinking/Thresholding for Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity (few nonzeros); low-rank (if a matrix); low “total-variation”; generalizability. (Vapnik: “...tradeoff between the quality of the approximation of the given data and the complexity of the approximating function.”) “Desirable” properties depend on context and application . A common way to obtain structured solutions is to modify the objective f by adding a regularizer τ ψ(x), for some parameter τ > 0. min f (x) + τ ψ(x). Often want to solve for a range of τ values, not just one value in isolation. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 46 / 82
  • 47. Basics of Shrinking Regularizer ψ is often nonsmooth but “simple.” Shrinking / thresholding approach (a.k.a. forward-backward splitting) is useful if the problem is easy to solve when f is replaced by a quadratic with diagonal Hessian: 1 min g T (z − x) + z −x 2 2 + τ ψ(z). z 2α Equivalently, 1 z − (x − αg ) 2 + τ ψ(z). min 2 2α z Define the shrinking operator as the arg min: 1 2 Sτ (y , α) := arg min z −y 2 + τ ψ(z). z 2α Typical algorithm: xk+1 = Sτ (xk − αk gk , αk ), with for example gk = f (xk ). Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 47 / 82
  • 48. “Practical” Instances of ψ Cases for which the subproblem is simple: ψ(z) = z 1 . Thus Sτ (y , α) = sign(y ) max(|y | − ατ, 0). When y complex, have max(|y | − τ α, 0) Sτ (y , α) = y. max(|y | − τ α, 0) + τ α ψ(z) = g ∈G z[g ] 2 or ψ(z) = g ∈G z[g ] ∞, where z[g ] , g ∈ G are non-overlapping subvectors of z. Here max(|y[g ] | − τ α, 0) Sτ (y , α)[g ] = y . max(|y[g ] | − τ α, 0) + τ α [g ] Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 48 / 82
  • 49. Z is a matrix and ψ(Z ) = Z ∗ is the nuclear norm of Z : the sum of singular values. Threshold operator is 1 2 Sτ (Y , α) := arg min Z −Y F +τ Z ∗ Z 2α with solution obtained from the SVD Y = UΣV T with U, V orthonormal and Σ = diag(σi )i=1,2,...,m . Setting ˜ Σ = diag(max(σi − τ α, 0)i=1,2,...,m ), the solution is ˜ Sτ (Y , α) = U ΣV T . (Actually not cheap to compute, but in some cases (e.g. σi − τ α < 0 for most i) approximate solutions can be found in reasoable time.) Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 49 / 82
  • 50. Connections The thresholding operator generalizes: Gradient methods for unconstrained minimization. Here ψ ≡ 0 and Sτ (y , α) = y . Projected gradient for minx∈Ω f (x) with Ω closed and convex. Here ψ is the indicator function for Ω (zero on Ω, ∞ elsewhere), and Sτ (y , α) = PΩ (y ), where PΩ is projection onto Ω. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 50 / 82
  • 51. Applications LASSO for variable selection. Originally stated as 1 Ax − b 2 such that x 1 ≤ T , min 2 2 x for parameter T > 0. Equivalent to an “ 2 - 1 ” formulation: 1 2 min Ax − b 2 +τ x 1, for some τ > 0. x 2 Group LASSO for selection of variable “groups.” 1 2 min Ax − b 2 + x[g ] 2 , x 2 g ∈G with each [g ] a subset of indices {1, 2, . . . , n}. When groups [g ] are disjoint, easy to solve the subproblem. Still true if · 2 is replaced by · ∞ . When groups overlap, can replicate variables, to have one copy of each variable in each group — thus reformulate as non-overlapping. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 51 / 82
  • 52. Compressed Sensing. Sparse signal recovery from noisy measurements. Given matrix A (with more columns than rows) and observation vector y , seek a sparse x (i.e. few nonzeros) such that Ax ≈ y . Solve 1 2 min Ax − b 2 +τ x 1. x 2 Under “restricted isometry” properties on A (“tall, thin” column submatrices are nearly orthonormal), x 1 is a good surrogate for card(x). Assume that A is not stored explicitly, but matrix-vector multiplications are available. Hence can compute f and f . Often need solution for a range of τ values. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 52 / 82
  • 53. 1 -Regularized Logistic Regression. Feature vectors xi , i = 1, 2, . . . , m with labels ±1. Seek odds function parametrized by w ∈ Rn : Tx p+ (x; w ) := (1 + e w )−1 , p− (x; w ) := 1 − p+ (x; w ). Scaled, negative log likelihood function L(w ) is   1 L(w ) = −  log p− (xi ; w ) + log p+ (xi ; w ) m yi =−1 yi =1   m 1 T =−  w T xi − log(1 + e w xi ) . m yi =−1 i=1 To get a sparse w (i.e. classify on the basis of a few features) solve: min L(w ) + λ w 1. w Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 53 / 82
  • 54. Matrix Completion. Seek a matrix X ∈ Rm×n with low rank that matches certain observations, possibly noisy. 1 2 min A(X ) − b 2 + τ ψ(X ), X 2 where A(X ) is a linear mapping of the components of X (e.g. element-wise observations). Can have ψ as the nuclear norm — see discussion above for solution of subproblems via SVD. At NIPS 2010: “Practical Large-Scale Optimization for Max-Norm Regularization” by Lee et al. discuss ψ(X ) = X max : X max := inf{ U 2,∞ , V 2,∞ | X = UV T }, where U 2,∞ is the maximum 2 norm of a row of U. The shrinking operation can be solved efficiently using the “squash” operator. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 54 / 82
  • 55. Basic Algorithm (Fukushima and Mine, 1981) for solving minx f (x) + τ ψ(x). 0: Choose x0 k: Choose αk > 0 and set xk+1 = Sτ (xk − αk f (xk ); αk ) 1 = arg min f (xk )T (z − xk ) + z − xk 2 2 + τ ψ(z). z 2αk Straightforward, but can be fast when the regularization is strong (i.e. solution is “highly constrained”). Can show convergence for steps αk ∈ (0, 2/L), where L is the bound on 2 f . (Like a short-step gradient method.) Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 55 / 82
  • 56. Enhancements Alternatively, since αk plays the role of a steplength, can adjust it to get better performance and guaranteed convergence. “Backtracking:” decrease αk until sufficient decrease condition holds. Use Barzilai-Borwein strategies to get nonmonotonic methods. By enforcing sufficient decrease every 10 iterations (say), still get global convergence. The approach can be accelerated using optimal gradient techniques. See earlier discussion of FISTA, where we solve the shrinking problem with αk = 1/L in place of a step along − f with this steplength. Note that these methods reduce ultimately to gradient methods on a reduced space: the optimal manifold defined by the regularizer ψ. Acceleration or higher-order information can help improve performance. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 56 / 82
  • 57. Continuation in τ Performance of basic shrinking methods is quite sensitive to τ . Typically higher τ ⇒ stronger regularization ⇒ optimal manifold has lower dimension. Hence, it’s easier to identify the optimal manifold, and basic shrinking methods can sometimes do so quickly. For smaller τ , a simple “continuation” strategy can help: 0: Given target value τf , choose initial τ0 > τf , starting point x and ¯ factor σ ∈ (0, 1). k: Find approx solution x(τk ) of minx f (x) + τ ψ(x), starting from x ; ¯ if τk = τf then STOP; Set τk+1 ← max(τf , στk ) and x ← x(τk ); ¯ Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 57 / 82
  • 58. Solution x(τ ) is often desired on a range of τ values anyway, so efforts for larger τ are not wasted. Accelerated methods such as FISTA are less sensitive to the “small τ ” issue. Not much analysis of this approach has been done. Better heuristics and theoretical support are needed. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 58 / 82
  • 59. Stochastic Gradient + Regularization Solve the regularized problem, but have only estimates of f (xk ). We can combine dual averaging, stochastic gradient, and shrinking: see Xiao (2010). min φτ (x) := Eξ f (x; ξ) + τ ψ(x) x At iteration k choose ξk randomly and i.i.d from the ξ distribution, and choose gk ∈ ∂f (xk ; ξk ). Use these to define the averaged subgradient gk = k gi /(k + 1), and solve the subproblem ¯ i=1 γ xk+1 = arg min gk x + τ ψ(x) + √ x − x0 2 . ¯T x k Same as earlier, but with regularizer ψ included explicitly. Can prove convergence results for averaged iterates xk : roughly ¯ C E φτ (¯k ) − φ∗ ≤ √ , x τ k where the expectation of φ is taken over the random number stream ξ0 , ξ1 , . . . , ξk−1 . Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 59 / 82
  • 60. Further Reading 1 F. Bach. “Sparse methods for machine learning: Theory and algorithms” Tutorial at NIPS 2009. (Slides on web.) 2 M. Fukushima and H. Mine. “A generalized proximal point algorithm for certain non-convex minimization problems.” International Journal of Systems Science, 12, pp. 989–1000, 1981. 3 P. L. Combettes and V. R. Wajs. “Signal recovery by proximal forward-backward splitting.” Multiscale Modeling and Simulation, 4, pp. 1168–1200, 2005. 4 S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. “Sparse reconstruction by separable approximation.” IEEE Transactions on Signal Processing, 57, pp. 2479–2493, 2009. 5 E. Cand`s, J. Romberg, and T. Tao. “Stable signal recovery for incomplete and e inaccurate measurements.” Communications in Pure and Applied Mathematics, 59, pp. 1207–1223, 2006. 6 L. Xiao. “Dual averaging methods for regularized stochastic learning and online optimization.” TechReport MSR-TR-2010-23, Microsoft Research, March 2010. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 60 / 82
  • 61. IV. Optimal Manifold Identification When constraints x ∈ Ψ or a nonsmooth regularizer ψ(x) are present, identification of the manifold on which x ∗ lies can improve algorithm performance, by focusing attention on a reduced space. We can thus evaluate partial gradients and Hessians, restricted to just this space. For nonsmooth regularizer ψ, the active manifold is a smooth surface passing through x ∗ along which ψ is smooth. Example: for ψ(x) = x 1 , have manifold consisting of z with  ≥ 0 if xi∗ > 0  zi ≤ 0 if xi∗ < 0  = 0 if xi∗ = 0.  Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 61 / 82
  • 62. For a polyhedral Ω, the active manifold is the face on which x ∗ lies. Example: For Ω = [0, 1]n , active manifold consists of z with  = 1  if xi∗ = 1 zi = 0 if xi∗ = 0  ∈ [0, 1] if xi∗ ∈ (0, 1).  M x* Can parametrize M with a single variable. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 62 / 82
  • 63. Identification Algorithms of shrinking / gradient projection type can identify the optimal manifold M, or a good approximation to it, without knowing x ∗ . HOW? Optimality conditions for the problem minx∈Ω f (x) are x∗ − f (x ∗ ) ∈ x ∗ + NΩ (x ∗ ), where NΩ (x) is the normal cone to Ω at x: NΩ (x) = {s | s T (z − x) ≤ 0 for all z ∈ Ω}. When M is defined appropriately, we find that R(Ω, M) := {x + s | x ∈ M and s ∈ NΩ (x)} has an interior in Rn . If x ∗ satisfies a nondegeneracy condition — that is, − f (x ∗ ) ∈ ri NΩ (x ∗ ) (relative interior), we have x− f (x) ∈ R(Ω, M) for all x sufficiently close to x ∗ . From such points, projection recovers M: PΩ (x − f (x)) ∈ M. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 63 / 82
  • 64. N x*−Df(x*) x* Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 64 / 82
  • 65. The non-averaged iterates from gradient projection methods eventually lie on the correct manifold M. The same is true for dual-averaging methods (where the gradient term is averaged over all steps). When we have a nonsmooth regularizer ψ, instead of Ω, the analogous property is that the solution of the shrink subproblem, for some fixed positive α, lies on the optimal manifold M. Under reasonable conditions on αk , the “basic” shrink method eventually has all its iterates on M. Also true for dual-averaged methods. In practice, often use heuristics for deciding when M (or a small superset) has been reached. If a distance-to-solution bound is available, and if Lipschitz constant for f is known, can make this decision more rigorous. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 65 / 82
  • 66. How Might This Help? Consider again logistic regression with regularizer ψ(w ) = w 1 .   m 1 T L(w ) = −  w T xi − log(1 + e w xi ) . m yi =−1 i=1 Requires calculation of Xw where X = [xiT ]m . (Can be cheap if w has i=1 few nonzeros.) For gradient, have T 1 −(1 + e w xi )−1 , yi = −1, L(w ) = X T u, where ui = T m (1 + e −w xi )−1 , yi = +1. requires m exponentials, and a matrix-vector multiply by X (with a full vector u). If just a subset G of components needed, multiply by a column submatrix T X·G — much cheaper than full gradient if |G| n. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 66 / 82
  • 67. Reduced Hessian T 2 1 e w xi L(w ) = X T diag(v )X , where vi = . n (1 + e w T xi )2 Often much cheaper to calculate |G| × |G| reduced Hessian than the full Hessian. Can use sampling (Byrd et al., 2010) to approximate the projected Hessian: take a subset S ⊂ {1, 2, . . . , m} and use XSG in place of X·G . Reduces evaluation cost by a factor |S|/m. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 67 / 82
  • 68. Higher-Order Shrinking Method for problem minx f (x) + τ x 1. Step k: Choose a subset Gk ⊃ {i | xk (i) = 0} Evaluate Gk f (xk ) and solve (in closed form): 1 T min f (xk )T d + d d + τ xk + d 1, s.t. d(i) = 0 for i ∈ Gk . / d 2αk Repeat with a decreasing sequence of αk , until sufficient decrease; + Set xk = xk + d; + Define Ck ⊂ Gk by Ck := {i | xk (i) = 0}. Calculate reduced Newton step in Ck components with (approximate) reduced Hessian matrix HCk Ck and reduced gradient Ck f , evaluated + at xk . Take this reduced step if is gives an improvement in objective, + giving xk+1 . Otherwise, settle for xk+1 ← xk . Can be generalized to other separable regularizers ψ(x). Choice of subset must conform to separation in regularizer, and all components must be checked periodically. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 68 / 82
  • 69. Newton-Like Methods Newton-like methods are ubiquitous in smooth optimization — motivated by second-order Taylor series. (Basic) Newton’s Method steps obtained from 1 xk+1 = arg min f (xk ) + f (xk )T (z − xk ) + (z − xk )T Hk (z − xk ), z 2 where Hk = 2 f (xk ). Near a local minimizer with second-order sufficient conditions, converges superlinearly: xk+1 − x ∗ = o( xk − x ∗ ). Can modify by adding a prox-term e.g. multiple of z − xk 2; Adding a trust-region constraint z − xk ≤ ∆k (equivalent); doing a line search along d. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 69 / 82
  • 70. Choices of Hk Hessian 2 f often expensive to evaluate, so can use approximations, e.g. Re-use 2 f from a previous iterate. Use a sampled approximation to 2 f (xk ) (see above). Use a diagonal approximation — at least gets the scaling right. See e.g. Barzilai-Borwein above. quasi-Newton methods (BFGS, L-BFGS), which define Hk to be a matrix that mimics the bevavior of the true Hessian over previous steps. Requires only gradients f . Other approximations that exploit the structure of the problem. e.g. for nonlinear least squares f (x) = (1/2) r (x) 2 for r : Rn → Rm , 2 Hessian is m 2 f (x) = J(x)T J(x) + (1/2) ri (x) 2 ri (x), i=1 where J is the m × n Jacobian of r . In Gauss-Newton method, use H(x) = J(x)T J(x). Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 70 / 82
  • 71. Higher-Order Information and Constraints Higher-order methods can be extended to presence of constraints x ∈ Ω or regularizers ψ(x) provided these elements can be incorporated explicitly into the subproblems. e.g. for constraints 1 xk+1 = arg min f (xk ) + f (xk )T (z − xk ) + (z − xk )T Hk (z − xk ), z∈Ω 2 and for regularizers 1 xk+1 = arg min f (xk ) + f (xk )T (z − xk ) + (z − xk )T Hk (z − xk ) + τ ψ(z). z 2 These subproblems are typically harder to solve than the “shrink” subproblem, unless Hk is simple (e.g. diagonal). In practice, can do manifold identification and reduction (see above), or form simpler approximations to Ω (but then may need to incorporate curvature information about Ω into Hk to ensure fast convergence). Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 71 / 82
  • 72. Solving for xk+1 When Hk positive definite, can solve Newton equations explicitly by solving Hk (z − xk ) = − f (xk ). Alternatively, apply conjugate gradient to this system to get an inexact solution. Each iterate requires a multiplication by Hk . Can precondition CG, e.g. by using sample approximations, structured approximations, or Krylov subspace information gathered at previous evaluations of 2 f . L-BFGS stores Hk in implicit form, by means of 2m vectors in Rn , for a small parameter m (e.g. 5). Recovers solution of the equation above via 2m inner products. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 72 / 82
  • 73. “Higher-Order” Methods in ML Several approaches tried (in addition to sampling and reduced-space techniques discussed above). Bordes et al. (2009, corrected 2010) for SVM m τwT w + (w ; xi , yi ), i=1 scales the stochastic gradient step with a diagonal Hk , obtained from finite differences of the last estimated gradient over the last step. Schraudolph et al. (AISTATS 2007) “online BFGS” uses conventional quasi-Newton update formulae (e.g. L-BFGS) based on estimated gradient differences over previous steps. Since the gradients are so inexact (based on just one data point), both in update and right-hand side of the step equations, these methods are really stochastic gradient with interesting scaling, rather than quasi-Newton in the conventional sense. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 73 / 82
  • 74. Further Reading 1 L. Bottou and A. Moore, “Learning with large datasets,” Tutorial at NIPS 2007 (slides on web). 2 J. Nocedal and S. J. Wright, Numerical Optimization, 2nd edition, Springer, 2006. 3 R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. “On the use of stochastic hessian information in unconstrained optimization.” Technical Report, Northwestern University, June 2010. 4 M. Fisher, J. Nocedal, Y. Tremolet, and S. J. Wright. “Data assimilation in weather forecasting: A case study in PDE-constrained optimization.” Optimization and Engineering, 10, pp. 409–426, 2009. 5 A. S. Lewis and S. J. Wright. “Identifying activity.” Technical report, ORIE, Cornell University. Revised April 2010. 6 S. J. Wright, “Accelerated block-coordinate relaxation for regularized optimization,” Technical Report, August 2010. 7 W. Hare and A. Lewis. “Identifying active constraints via partial smoothness and prox-regularity.” Journal of Convex Analysis, 11, pp. 251–266, 2004. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 74 / 82
  • 75. V. Decomposition / Coordinate Relaxation For min f (x), at iteration k, choose a subset Gk ⊂ {1, 2, . . . , n} and take a step dk only in these components. i.e. fix dk (i) = 0 for i ∈ Gk . / Gives more manageable subproblem sizes, in practice. Can take a reduced gradient step in the Gk components; take multiple “inner iterations” actually solve the reduced subproblem in the space defined by Gk . Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 75 / 82
  • 76. Constraints and Regularizers Complicate Things For minx∈Ω f (x), need to put enough components into Gk to stay feasible, as well as make progress. Example: min f (x1 , x2 ) with x1 + x2 = 1. Relaxation with Gk = {1} or Gk = {2} won’t work. For separable regularizer (e.g. Group LASSO) with ψ(x) = ψg (x[g ] ), g ∈G need to ensure that Gk is a union of the some index subsets [g ]. i.e. the relaxation components must be consonant with the partitioning. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 76 / 82
  • 77. Decomposition and Dual SVM Decomposition has long been popular for solving the dual (QP) formulation of SVM, since the number of variables (= number of training examples) is sometimes very large. SMO: Each Gk has two components. LIBSVM: SMO approach (still |Gk | = 2), with different heuristic for choosing Gk . LASVM: Again |Gk | = 2, with focus on online setting. SVM-light: Small |Gk | (default 10). GPDT: Larger |Gk | (default 400) with gradient projection solver as inner loop. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 77 / 82
  • 78. Choice of Gk and Convergence Results Some methods (e.g. Tseng and Yun, 2010) require Gk to be chosen so that the improvement in subproblem objective obtained over the subset Gk is at least a fixed fraction of the improvement available over the whole space. Undesirable, since to check it, usually need to evaluate the full gradient f (xk ). Alternative is a generalized Gauss-Seidel requirement, where each coordinate is “touched” at least once every T iterations: Gk ∪ Gk+1 ∪ . . . ∪ Gk+T −1 = {1, 2, . . . , n}. Can show global convergence (e.g. Tseng and Yun, 2009; Wright, 2010). There are also results on global linear convergence rates optimal manifold identification fast local convergence for an algorithm that takes reduced steps on the estimated optimal manifold. All deterministic analyses. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 78 / 82
  • 79. Stochastic Coordinate Descent Analysis tools of stochastic gradient may be useful. If steps have the form xk+1 = xk − αk gk , where [ f (xk )]i if i ∈ Gk gk (i) = 0 otherwise, With suitable random selection of Gk can ensure that gk (appropriately scaled) is an unbiased estimate of f (xk ). Hence can apply SGD techniqes discussed earlier, to choose αk and obtain convergence. Nesterov (2010) proposes another randomized approach for the unconstrained problem with known separate Lipschitz constants Li : ∂fi ∂fi (x + hei ) − (x) ≤ Li |h|, i = 1, 2, . . . , n. ∂xi ∂xi (Works with blocks too, instead of individual components.) Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 79 / 82
  • 80. At step k: n Choose index ik ∈ {1, 2, . . . , n} with probability pi := Li /( j=1 Lj ); Take gradient step in ik component: 1 ∂f xk+1 = xk − ei . Lik ∂xik k Basic convergence result: C E [f (xk )] − f ∗ ≤ . k As for SA (earlier) but without any strong convexity assumption. Can also get linear convergence results (in expectation) by assuming strong convexity in f , according to different norms. Can also accelerate in the usual fashion (see above), to improve expected convergence rate to O(1/k 2 ). Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 80 / 82
  • 81. Further Reading 1 P. Tseng and S. Yun, “A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training.” Computational Optimization and Applications, 47, pp. 179–206, 2010. 2 P. Tseng and S. Yun, “A coordinate gradient descent method for nonsmooth separable minimization.” Mathematical Programming, Series B, 117. pp. 387–423, 2009. 3 S. J. Wright, “Accelerated block-coordinate relaxation for regularized optimization.” Technical Report, UW-Madison, August 2010. 4 Y. Nesterov, “Efficiency of coordinate descent methods on huge-scale optimization problems.” CORE Discussion Paper 2010/2, CORE, UCL, January 2010. Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 81 / 82
  • 82. Conclusions We’ve surveyed a number of topics in algorithmic fundamentals, with an eye on recent developments, and on topics of relevance (current or future) to machine learning. The talk was far from exhaustive. Literature on Optimization in ML is huge and growing. There is much more to be gained from the interaction between the two areas. FIN Stephen Wright (UW-Madison) Optimization in Machine Learning NIPS Tutorial, 6 Dec 2010 82 / 82