SlideShare a Scribd company logo
Technical Tricks of Vowpal Wabbit



                http://hunch.net/~vw/


    John Langford, Columbia, Data-Driven Modeling,
                       April 16


  git clone
  git://github.com/JohnLangford/vowpal_wabbit.git
Goals of the VW project
   1   State of the art in scalable, fast, ecient
       Machine Learning. VW is (by far) the most
       scalable public linear learner, and plausibly the
       most scalable anywhere.
   2   Support research into new ML algorithms. ML
       researchers can deploy new algorithms on an
       ecient platform eciently. BSD open source.
   3   Simplicity. No strange dependencies, currently
       only 9437 lines of code.
   4   It just works. A package in debian  R.
       Otherwise, users just type make, and get a
       working system. At least a half-dozen companies
       use VW.
Demonstration


  vw -c rcv1.train.vw.gz exact_adaptive_norm
  power_t 1 -l 0.5
The basic learning algorithm
  Learn w such that fw (x )    = w .x   predicts well.
    1    Online learning with strong defaults.
    2    Every input source but library.
    3    Every output sink but library.
    4    In-core feature manipulation for ngrams, outer
         products, etc... Custom is easy.
    5    Debugging with readable models  audit mode.
    6    Dierent loss functions  squared, logistic, ...
    7
          1 and   2 regularization.
    8    Compatible LBFGS-based batch-mode
         optimization
    9    Cluster parallel
    10   Daemon deployable.
The tricks

      Basic VW         Newer Algorithmics       Parallel Stu

   Feature Caching     Adaptive Learning    Parameter Averaging
   Feature Hashing     Importance Updates    Nonuniform Average
   Online Learning      Dim. Correction       Gradient Summing
   Implicit Features        L-BFGS            Hadoop AllReduce
                        Hybrid Learning
  We'll discuss Basic VW and algorithmics, then Parallel.
Feature Caching


  Compare: time vw rcv1.train.vw.gz exact_adaptive_norm
  power_t 1
Feature Hashing



                                   String − Index dictionary
                        RAM




                                 Weights Weights
                              Conventional VW


  Most algorithms use a hashmap to change a word into an
  index for a weight.
  VW uses a hash function which takes almost no RAM, is x10
  faster, and is easily parallelized.
The spam example [WALS09]

    1   3.2   ∗ 106   labeled emails.

    2   433167 users.

    3   ∼ 40 ∗ 106     unique features.


  How do we construct a spam lter which is personalized, yet
  uses global information?
The spam example [WALS09]

    1   3.2   ∗ 106   labeled emails.

    2   433167 users.

    3   ∼ 40 ∗ 106     unique features.


  How do we construct a spam lter which is personalized, yet
  uses global information?




  Answer: Use hashing to predict according to:
  w , φ(x )   +   w , φ (x )
                        u
Results




  (baseline = global only predictor)
Basic Online Learning

  Start with   ∀i :       w
                          i   = 0,   Repeatedly:

    1   Get example  x ∈ (∞, ∞)∗.
    2   Make prediction y −
                         ˆ        w x clipped to interval [0, 1].
                                         i   i   i


    3   Learn truth y ∈ [0, 1] with importance I or goto (1).

    4   Update w ← w + η 2(y − y )Ix and go to (1).
                      i        i    ˆ                i
Reasons for Online Learning
    1   Fast convergence to a good predictor

    2   It's RAM ecient. You need store only one example in
        RAM rather than all of them.   ⇒   Entirely new scales of
        data are possible.

    3   Online Learning algorithm = Online Optimization
        Algorithm. Online Learning Algorithms    ⇒   the ability to
        solve entirely new categories of applications.

    4   Online Learning = ability to deal with drifting
        distributions.
Implicit Outer Product
  Sometimes you care about the interaction of two sets of
  features (ad features x query features, news features x user
  features, etc...).
  Choices:

     1   Expand the set of features explicitly, consuming   n2 disk
         space.

     2   Expand the features dynamically in the core of your
         learning algorithm.
Implicit Outer Product
  Sometimes you care about the interaction of two sets of
  features (ad features x query features, news features x user
  features, etc...).
  Choices:

     1   Expand the set of features explicitly, consuming   n2 disk
         space.

     2   Expand the features dynamically in the core of your
         learning algorithm.

  Option (2) is x10 faster. You need to be comfortable with
  hashes rst.
The tricks

      Basic VW          Newer Algorithmics      Parallel Stu

   Feature Caching      Adaptive Learning    Parameter Averaging
   Feature Hashing      Importance Updates   Nonuniform Average
   Online Learning       Dim. Correction     Gradient Summing
   Implicit Features         L-BFGS           Hadoop AllReduce
                         Hybrid Learning
  Next: algorithmics.
Adaptive Learning [DHS10,MS10]

  For example   t , let g   it   = 2(y − y )xit .
                                         ˆ
Adaptive Learning [DHS10,MS10]

  For example   t , let g   it   = 2(y − y )xit .
                                         ˆ


  New update rule:    w i        ← wi − η √Pit,t +1
                                              g
                                                           2
                                                  t   =1 git
Adaptive Learning [DHS10,MS10]

  For example   t , let g   it   = 2(y − y )xit .
                                         ˆ


  New update rule:    w i        ← wi − η √Pit,t +1
                                              g
                                                           2
                                                  t   =1 git


  Common features stabilize quickly. Rare features can have
  large updates.
Learning with importance weights [KL11]




                   y
Learning with importance weights [KL11]




           wt x    y
Learning with importance weights [KL11]

                  −η(   ) x




           wt x           y
Learning with importance weights [KL11]

               −η(   ) x




           wt x wt+1 x   y
Learning with importance weights [KL11]

                       −6η(   ) x




           wt x    y
Learning with importance weights [KL11]

                       −6η(   ) x




           wt x    y           wt+1 x ??
Learning with importance weights [KL11]

                  −η(   ) x




           wt x           y   wt+1 x
Learning with importance weights [KL11]




           wt x wt+1 x y
Learning with importance weights [KL11]




               s(h)||x||2

           wt x wt+1 x y
Robust results for unweighted problems
                                          astro - logistic loss                                                          spam - quantile loss
              0.97                                                                            0.98

              0.96                                                                            0.97

                                                                                              0.96
              0.95
                                                                                              0.95
   standard




                                                                                   standard
              0.94
                                                                                              0.94
              0.93
                                                                                              0.93
              0.92
                                                                                              0.92
              0.91                                                                            0.91

               0.9                                                                             0.9
                     0.9   0.91    0.92      0.93   0.94   0.95    0.96    0.97                      0.9   0.91   0.92    0.93 0.94 0.95        0.96   0.97   0.98
                                          importance aware                                                                importance aware

                                          rcv1 - squared loss                                                            webspam - hinge loss
               0.95                                                                             1
              0.945                                                                           0.99
               0.94                                                                           0.98
              0.935                                                                           0.97
               0.93                                                                           0.96
   standard




                                                                                   standard


              0.925                                                                           0.95
               0.92                                                                           0.94
              0.915                                                                           0.93
               0.91                                                                           0.92
              0.905                                                                           0.91
                0.9                                                                            0.9
                      0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95                     0.9   0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
                                         importance aware                                                              importance aware
Dimensional Correction
  Gradient of squared loss =
                             ∂(fw (x )−y )2
                                  ∂ wi
                                                   f x
                                            = 2( w ( ) −   y )x   i   and
  change weights in the negative gradient direction:


                                     ∂(fw (x ) − y )2
                    w i   ← wi − η
                                           ∂ wi
Dimensional Correction
  Gradient of squared loss =
                             ∂(fw (x )−y )2
                                  ∂ wi
                                            = 2( w ( ) − f x   y )x   i   and
  change weights in the negative gradient direction:


                                     ∂(fw (x ) − y )2
                    w i   ← wi − η
                                           ∂ wi

  But the gradient has intrinsic problems. w naturally has units
                                                     i

  of 1/i since doubling x implies halving w to get the same
                           i                     i

  prediction.
  ⇒   Update rule has mixed units!
Dimensional Correction
  Gradient of squared loss =
                             ∂(fw (x )−y )2
                                  ∂ wi
                                            = 2( w ( ) −            f x   y )x   i   and
  change weights in the negative gradient direction:


                                           ∂(fw (x ) − y )2
                     wi    ← wi − η
                                                 ∂ wi

  But the gradient has intrinsic problems. w naturally has unitsi

  of 1/i since doubling x implies halving w to get the same
                                 i                          i

  prediction.
  ⇒   Update rule has mixed units!




  A crude x: divide update by                   x 2.
                                      It helps much!
                                             i    i
                                                      (fw (x )−y )2
  This is scary! The problem optimized is minw           P 2
                                                 x ,y
                                                            i xi
  rather than minw        x ,y
                                 (fw (x ) − y )2 .      But it works.
LBFGS [Nocedal80]
  Batch(!) second order algorithm. Core idea = ecient
  approximate Newton step.
LBFGS [Nocedal80]
  Batch(!) second order algorithm. Core idea = ecient
  approximate Newton step.




  H = ∂ (∂w (i ∂)−j )
          2   f       x       y
                                  2
                                      = Hessian.
                  w       w

  Newton step =                       w → w + H −1g .
LBFGS [Nocedal80]
  Batch(!) second order algorithm. Core idea = ecient
  approximate Newton step.




  H = ∂ (∂w (i ∂)−j )
          2   f       x       y
                                  2
                                      = Hessian.
                  w       w

  Newton step =                       w → w + H −1g .

  Newton fails: you can't even represent                H.
  Instead build up approximate inverse Hessian according to:
  ∆w ∆Tw where ∆ is a change in weights
   ∆T ∆g
    w            w                                      w
                                            and ∆g is a change

  in the loss gradient                   g.
Hybrid Learning
  Online learning is GREAT for getting to a good solution fast.
  LBFGS is GREAT for getting a perfect solution.
Hybrid Learning
  Online learning is GREAT for getting to a good solution fast.
  LBFGS is GREAT for getting a perfect solution.
  Use Online Learning, then LBFGS

                                                                          0.484
           0.55
                                                                          0.482
            0.5
                                                                           0.48
           0.45                                                           0.478
   auPRC




                                                                  auPRC
                                                                          0.476
            0.4
                                                                          0.474
           0.35
                                                                          0.472
            0.3                                                            0.47
                                     Online                                                        Online
                                     L−BFGS w/ 5 online passes            0.468                    L−BFGS w/ 5 online passes
           0.25                      L−BFGS w/ 1 online pass                                       L−BFGS w/ 1 online pass
                                     L−BFGS                               0.466                    L−BFGS
            0.2
               0   10   20               30       40         50               0   5   10               15         20
                             Iteration                                                     Iteration
The tricks

      Basic VW         Newer Algorithmics      Parallel Stu

   Feature Caching     Adaptive Learning    Parameter Averaging
   Feature Hashing     Importance Updates   Nonuniform Average
   Online Learning      Dim. Correction     Gradient Summing
   Implicit Features        L-BFGS           Hadoop AllReduce
                        Hybrid Learning
  Next: Parallel.
Applying for a fellowship in 1997
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
  John: I'd like to solve AI.
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
  John: I'd like to solve AI.
  I: How?
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
  John: I'd like to solve AI.
  I: How?
  J: I want to use parallel learning algorithms to create fantastic
  learning machines!
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
  John: I'd like to solve AI.
  I: How?
  J: I want to use parallel learning algorithms to create fantastic
  learning machines!
  I: You fool! The only thing parallel machines are good for is
  computational windtunnels!
Applying for a fellowship in 1997

  Interviewer: So, what do you want to do?
  John: I'd like to solve AI.
  I: How?
  J: I want to use parallel learning algorithms to create fantastic
  learning machines!
  I: You fool! The only thing parallel machines are good for is
  computational windtunnels!
  The worst part: he had a point.
Terascale Linear Learning ACDL11
  Given 2.1 Terafeatures of data, how can you learn a good
  linear predictor   f (x ) =
                     w          i
                                    wx?
                                    i   i
Terascale Linear Learning ACDL11
  Given 2.1 Terafeatures of data, how can you learn a good
  linear predictor   f (x ) =
                     w          i
                                    wx?
                                    i   i




  2.1T sparse features
  17B Examples
  16M parameters
  1K nodes
Terascale Linear Learning ACDL11
  Given 2.1 Terafeatures of data, how can you learn a good
  linear predictor   f (x ) =
                     w          i
                                    wx?
                                    i   i




  2.1T sparse features
  17B Examples
  16M parameters
  1K nodes




  70 minutes = 500M features/second: faster than the IO
  bandwidth of a single machine⇒ we beat all possible single
  machine linear learning algorithms.
Features/s




                      100
                     1000
                    10000
                   100000
                    1e+06
                    1e+07
                    1e+08
      RBF-SVM       1e+09
       MPI?-500
           RCV1
  Ensemble Tree
        MPI-128
       Synthetic
                             single
                            parallel




      RBF-SVM
        TCP-48
                                                          Parallel Learning book



    MNIST 220K
   Decision Tree
    MapRed-200
    Ad-Bounce #
     Boosted DT
         MPI-32
                                       Speed per method




      Ranking #
          Linear
      Threads-2
           RCV1
          Linear
adoop+TCP-1000
          Ads *
                                                          Compare: Other Supervised Algorithms in
MPI-style AllReduce
   Allreduce initial state


       5       7       6

   1       2       3         4
MPI-style AllReduce
   Allreduce final state


        28        28        28

   28        28        28        28
MPI-style AllReduce
   Create Binary Tree
               7

       5               6

   1       2       3       4
MPI-style AllReduce
   Reducing, step 1
               7

       8               13

   1       2       3        4
MPI-style AllReduce
   Reducing, step 2
               28

       8                13

   1       2        3        4
MPI-style AllReduce
   Broadcast, step 1
                28

       28                28

   1        2        3        4
MPI-style AllReduce
   Allreduce final state
                  28

        28                  28

   28        28        28        28
  AllReduce = Reduce+Broadcast
MPI-style AllReduce
    Allreduce final state
                      28

          28                       28

    28          28            28        28
  AllReduce = Reduce+Broadcast
  Properties:

    1    Easily pipelined so no latency concerns.

    2    Bandwidth   ≤ 6n .
    3    No need to rewrite code!
An Example Algorithm: Weight averaging
  n = AllReduce(1)
  While (pass number      max)

    1   While (examples left)

          1   Do online update.

    2   AllReduce(weights)

    3   For each weight   w ← w /n
An Example Algorithm: Weight averaging
  n = AllReduce(1)
  While (pass number      max)

    1   While (examples left)

          1   Do online update.

    2   AllReduce(weights)

    3   For each weight   w ← w /n

  Other algorithms implemented:

    1   Nonuniform averaging for online learning

    2   Conjugate Gradient

    3   LBFGS
What is Hadoop AllReduce?
                                   Program
         Data


   1

       Map job moves program to data.
What is Hadoop AllReduce?
                                       Program
         Data


   1

       Map job moves program to data.

   2   Delayed initialization: Most failures are disk failures.
       First read (and cache) all data, before initializing
       allreduce. Failures autorestart on dierent node with
       identical data.
What is Hadoop AllReduce?
                                       Program
         Data


   1

       Map job moves program to data.

   2   Delayed initialization: Most failures are disk failures.
       First read (and cache) all data, before initializing
       allreduce. Failures autorestart on dierent node with
       identical data.

   3   Speculative execution: In a busy cluster, one node is
       often slow. Hadoop can speculatively start additional
       mappers. We use the rst to nish reading all data once.
Approach Used
   1   Optimize hard so few data passes required.

         1   Normalized, adaptive, safe, online, gradient
             descent.
         2   L-BFGS
         3   Use (1) to warmstart (2).
Approach Used
   1   Optimize hard so few data passes required.

         1   Normalized, adaptive, safe, online, gradient
             descent.
         2   L-BFGS
         3   Use (1) to warmstart (2).

   2   Use map-only Hadoop for process control and error
       recovery.
Approach Used
   1   Optimize hard so few data passes required.

         1   Normalized, adaptive, safe, online, gradient
             descent.
         2   L-BFGS
         3   Use (1) to warmstart (2).

   2   Use map-only Hadoop for process control and error
       recovery.

   3   Use AllReduce code to sync state.
Approach Used
   1   Optimize hard so few data passes required.

         1   Normalized, adaptive, safe, online, gradient
             descent.
         2   L-BFGS
         3   Use (1) to warmstart (2).

   2   Use map-only Hadoop for process control and error
       recovery.

   3   Use AllReduce code to sync state.

   4   Always save input examples in a cachele to speed later
       passes.
Approach Used
   1   Optimize hard so few data passes required.

         1   Normalized, adaptive, safe, online, gradient
             descent.
         2   L-BFGS
         3   Use (1) to warmstart (2).

   2   Use map-only Hadoop for process control and error
       recovery.

   3   Use AllReduce code to sync state.

   4   Always save input examples in a cachele to speed later
       passes.

   5   Use hashing trick to reduce input complexity.
Approach Used
    1   Optimize hard so few data passes required.

          1   Normalized, adaptive, safe, online, gradient
              descent.
          2   L-BFGS
          3   Use (1) to warmstart (2).

    2   Use map-only Hadoop for process control and error
        recovery.

    3   Use AllReduce code to sync state.

    4   Always save input examples in a cachele to speed later
        passes.

    5   Use hashing trick to reduce input complexity.


  Open source in Vowpal Wabbit 6.1. Search for it.
Robustness  Speedup
                                       Speed per method
              10
                        Average_10
               9            Min_10
               8           Max_10
                              linear
               7
    Speedup




               6
               5
               4
               3
               2
               1
               0
                   10     20    30     40   50   60   70   80   90   100
                                             Nodes
Splice Site Recognition
            0.55

             0.5

            0.45
    auPRC




             0.4

            0.35

             0.3
                                      Online
                                      L−BFGS w/ 5 online passes
            0.25                      L−BFGS w/ 1 online pass
                                      L−BFGS
             0.2
                0   10   20               30       40         50
                              Iteration
Splice Site Recognition
           0.6

           0.5


           0.4
   auPRC




           0.3


           0.2

                                       L−BFGS w/ one online pass
           0.1                         Zinkevich et al.
                                       Dekel et al.
            0
                 0     5              10             15            20
                     Effective number of passes over data
To learn more

  The wiki has tutorials, examples, and help:
  https://github.com/JohnLangford/vowpal_wabbit/wiki


  Mailing List: vowpal_wabbit@yahoo.com


  Various discussion: http://hunch.net Machine Learning
  (Theory) blog
Bibliography: Original VW
Caching L. Bottou. Stochastic Gradient Descent Examples on
         Toy Problems,
         http://leon.bottou.org/projects/sgd,        2007.

                                      http:
Release Vowpal Wabbit open source project,
         //github.com/JohnLangford/vowpal_wabbit/wiki,
         2007.

Hashing Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola,
         and SVN Vishwanathan, Hash Kernels for Structured
         Data, AISTAT 2009.

Hashing K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and
         J. Attenberg, Feature Hashing for Large Scale Multitask
         Learning, ICML 2009.
Bibliography: Algorithmics
L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with
         Limited Storage, Mathematics of Computation
         35:773782, 1980.

Adaptive H. B. McMahan and M. Streeter, Adaptive Bound
         Optimization for Online Convex Optimization, COLT
         2010.

Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive Subgradient
         Methods for Online Learning and Stochastic
         Optimization, COLT 2010.

    Safe N. Karampatziakis, and J. Langford, Online Importance
         Weight Aware Updates, UAI 2011.
Bibliography: Parallel
grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A Scalable
          Modular Convex Solver for Regularized Risk
          Minimization, KDD 2007.

   avg. 1 G. Mann et al. Ecient large-scale distributed training
          of conditional maximum entropy models, NIPS 2009.

   avg. 2 K. Hall, S. Gilpin, and G. Mann, MapReduce/Bigtable
          for Distributed Optimization, LCCC 2010.

  ov. avg M. Zinkevich, M. Weimar, A. Smola, and L. Li,
          Parallelized Stochastic Gradient Descent, NIPS 2010.

P. online D. Hsu, N. Karampatziakis, J. Langford, and A. Smola,
          Parallel Online Learning, in SUML 2010.

D. Mini 1 O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao,
          Optimal Distributed Online Predictions Using Minibatch,
          http://arxiv.org/abs/1012.1367
Vowpal Wabbit Goals for Future
Development
   1   Native learning reductions. Just like more complicated
       losses. In development now.

   2   Librarication, so people can use VW in their favorite
       language.

   3   Other learning algorithms, as interest dictates.

   4   Various further optimizations. (Allreduce can be
       improved by a factor of 3...)
Reductions
Goal: minimize   on   D
                                Transform     D into D

                                     Algorithm for optimizing   0/1


                                 h

  Transform  h with small 0/1(h, D ) into R with small (R , D ).
                                              h            h


  such that if h does well on (D , 0,1 ), R is guaranteed to do
                                          h

  well on (D , ).
The transformation


  R = transformer from complex example to simple example.

  R −1 = transformer from simple predictions to complex
  prediction.
example: One Against All

  Create   k binary regression problems, one per class.
  For class  i predict Is the label i or not?
                                (x , 1(y = 1))
                                
                                (x , 1(y = 2))
                                
                                
                    (x , y ) −→
                                . . .
                                  (x , 1(y = k ))
                                
                                
                                

  Multiclass prediction: evaluate all the classiers and choose
  the largest scoring label.
The code: oaa.cc
  // parses reduction-specic ags.
  void parse_ags(size_t s, void (*base_l)(example*), void
  (*base_f )())


  // Implements   R and R −1 using base_l.
  void learn(example* ec)


  // Cleans any temporary state and calls base_f.
  void nish()


  The important point: anything tting this interface is easy to
  code in VW now, including all forms of feature diddling and
  creation.
  And reductions inherit all the
  input/output/optimization/parallelization of VW!
Reductions implemented
   1   One-Against-All ( oaa k). The baseline multiclass
       reduction.

   2   Cost Sensitive One-Against-All ( csoaa k).
       Predicts cost of each label and minimizes the cost.

   3   Weighted All-Pairs ( wap k). An alternative to
       csoaa with better theory.

   4   Cost Sensitive One-Against-All with Label Dependent
       Features ( csoaa_ldf). As csoaa, but features not
       shared between labels.

   5   WAP with Label Dependent Features ( wap_ldf).

   6   Sequence Prediction ( sequence k). A simple
       implementation of Searn and Dagger for sequence
       prediction. Uses cost sensitive predictor.
Reductions to Implement
           Regret Transform Reductions
                                                     AUC Ranking
                                              1
                    Regret multiplier              Quicksort     Algorithm Name
                             Classification

                                     Ei Costing
                        1
  Quantile Regression         IW Classification 1 Mean Regression
                    Quanting                     Probing
                       Offset
                    k−1Tree       4 ECT                        4 PECOC

   k−Partial Label         k−Classification            k−way Regression
                                               k/2 Filter
                                                   Tree
                                                 k−cost
                                              Classification
                                     Tk PSDP                Tk ln T Searn

                T step RL with State Visitation     T Step RL with Demonstration Policy

                                       ??                         ??

                   Dynamic Models                 Unsupervised by Self Prediction

More Related Content

What's hot

What's hot (20)

Warsaw Data Science - Factorization Machines Introduction
Warsaw Data Science -  Factorization Machines IntroductionWarsaw Data Science -  Factorization Machines Introduction
Warsaw Data Science - Factorization Machines Introduction
 
【DL輪読会】A Path Towards Autonomous Machine Intelligence
【DL輪読会】A Path Towards Autonomous Machine Intelligence【DL輪読会】A Path Towards Autonomous Machine Intelligence
【DL輪読会】A Path Towards Autonomous Machine Intelligence
 
「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1「これからの強化学習」勉強会#1
「これからの強化学習」勉強会#1
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
 
머신러닝 + 주식 삽질기
머신러닝 + 주식 삽질기머신러닝 + 주식 삽질기
머신러닝 + 주식 삽질기
 
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
 
Optuna on Kubeflow Pipeline 分散ハイパラチューニング
Optuna on Kubeflow Pipeline 分散ハイパラチューニングOptuna on Kubeflow Pipeline 分散ハイパラチューニング
Optuna on Kubeflow Pipeline 分散ハイパラチューニング
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기
 
全力解説!Transformer
全力解説!Transformer全力解説!Transformer
全力解説!Transformer
 
YOLOv3 , Mask R-CNNなどの一般物体検出技術をG空間分野に活用する(FOSS4G 2018 Tokyo)
YOLOv3 , Mask R-CNNなどの一般物体検出技術をG空間分野に活用する(FOSS4G 2018 Tokyo)YOLOv3 , Mask R-CNNなどの一般物体検出技術をG空間分野に活用する(FOSS4G 2018 Tokyo)
YOLOv3 , Mask R-CNNなどの一般物体検出技術をG空間分野に活用する(FOSS4G 2018 Tokyo)
 
Long-Tailed Classificationの最新動向について
Long-Tailed Classificationの最新動向についてLong-Tailed Classificationの最新動向について
Long-Tailed Classificationの最新動向について
 
できる!遺伝的アルゴリズム
できる!遺伝的アルゴリズムできる!遺伝的アルゴリズム
できる!遺伝的アルゴリズム
 
実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)
実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)
実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)
 
協調フィルタリング入門
協調フィルタリング入門協調フィルタリング入門
協調フィルタリング入門
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
 
MediaPipeを使ったARアプリ開発事例 ~カメラをかざして家䛾中で売れるも䛾を探そう~
MediaPipeを使ったARアプリ開発事例 ~カメラをかざして家䛾中で売れるも䛾を探そう~MediaPipeを使ったARアプリ開発事例 ~カメラをかざして家䛾中で売れるも䛾を探そう~
MediaPipeを使ったARアプリ開発事例 ~カメラをかざして家䛾中で売れるも䛾を探そう~
 
Developing a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with SparkDeveloping a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with Spark
 
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
 

Similar to Technical Tricks of Vowpal Wabbit

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
pauldix
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
taikhoan262
 

Similar to Technical Tricks of Vowpal Wabbit (20)

Dual SVM Problem.pdf
Dual SVM Problem.pdfDual SVM Problem.pdf
Dual SVM Problem.pdf
 
Notes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVMNotes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVM
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
super vector machines algorithms using deep
super vector machines algorithms using deepsuper vector machines algorithms using deep
super vector machines algorithms using deep
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKSSupport Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Svm map reduce_slides
Svm map reduce_slidesSvm map reduce_slides
Svm map reduce_slides
 
Svm V SVC
Svm V SVCSvm V SVC
Svm V SVC
 
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 

More from jakehofman

NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
jakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
jakehofman
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
jakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
jakehofman
 

More from jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 

Recently uploaded

Recently uploaded (20)

PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Keeping Your Information Safe with Centralized Security Services
Keeping Your Information Safe with Centralized Security ServicesKeeping Your Information Safe with Centralized Security Services
Keeping Your Information Safe with Centralized Security Services
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptx
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
 
The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resources
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resources
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
 
The impact of social media on mental health and well-being has been a topic o...
The impact of social media on mental health and well-being has been a topic o...The impact of social media on mental health and well-being has been a topic o...
The impact of social media on mental health and well-being has been a topic o...
 
The Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. HenryThe Last Leaf, a short story by O. Henry
The Last Leaf, a short story by O. Henry
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
Morse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxMorse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptx
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 

Technical Tricks of Vowpal Wabbit

  • 1. Technical Tricks of Vowpal Wabbit http://hunch.net/~vw/ John Langford, Columbia, Data-Driven Modeling, April 16 git clone git://github.com/JohnLangford/vowpal_wabbit.git
  • 2. Goals of the VW project 1 State of the art in scalable, fast, ecient Machine Learning. VW is (by far) the most scalable public linear learner, and plausibly the most scalable anywhere. 2 Support research into new ML algorithms. ML researchers can deploy new algorithms on an ecient platform eciently. BSD open source. 3 Simplicity. No strange dependencies, currently only 9437 lines of code. 4 It just works. A package in debian R. Otherwise, users just type make, and get a working system. At least a half-dozen companies use VW.
  • 3. Demonstration vw -c rcv1.train.vw.gz exact_adaptive_norm power_t 1 -l 0.5
  • 4. The basic learning algorithm Learn w such that fw (x ) = w .x predicts well. 1 Online learning with strong defaults. 2 Every input source but library. 3 Every output sink but library. 4 In-core feature manipulation for ngrams, outer products, etc... Custom is easy. 5 Debugging with readable models audit mode. 6 Dierent loss functions squared, logistic, ... 7 1 and 2 regularization. 8 Compatible LBFGS-based batch-mode optimization 9 Cluster parallel 10 Daemon deployable.
  • 5. The tricks Basic VW Newer Algorithmics Parallel Stu Feature Caching Adaptive Learning Parameter Averaging Feature Hashing Importance Updates Nonuniform Average Online Learning Dim. Correction Gradient Summing Implicit Features L-BFGS Hadoop AllReduce Hybrid Learning We'll discuss Basic VW and algorithmics, then Parallel.
  • 6. Feature Caching Compare: time vw rcv1.train.vw.gz exact_adaptive_norm power_t 1
  • 7. Feature Hashing String − Index dictionary RAM Weights Weights Conventional VW Most algorithms use a hashmap to change a word into an index for a weight. VW uses a hash function which takes almost no RAM, is x10 faster, and is easily parallelized.
  • 8. The spam example [WALS09] 1 3.2 ∗ 106 labeled emails. 2 433167 users. 3 ∼ 40 ∗ 106 unique features. How do we construct a spam lter which is personalized, yet uses global information?
  • 9. The spam example [WALS09] 1 3.2 ∗ 106 labeled emails. 2 433167 users. 3 ∼ 40 ∗ 106 unique features. How do we construct a spam lter which is personalized, yet uses global information? Answer: Use hashing to predict according to: w , φ(x ) + w , φ (x ) u
  • 10. Results (baseline = global only predictor)
  • 11. Basic Online Learning Start with ∀i : w i = 0, Repeatedly: 1 Get example x ∈ (∞, ∞)∗. 2 Make prediction y − ˆ w x clipped to interval [0, 1]. i i i 3 Learn truth y ∈ [0, 1] with importance I or goto (1). 4 Update w ← w + η 2(y − y )Ix and go to (1). i i ˆ i
  • 12. Reasons for Online Learning 1 Fast convergence to a good predictor 2 It's RAM ecient. You need store only one example in RAM rather than all of them. ⇒ Entirely new scales of data are possible. 3 Online Learning algorithm = Online Optimization Algorithm. Online Learning Algorithms ⇒ the ability to solve entirely new categories of applications. 4 Online Learning = ability to deal with drifting distributions.
  • 13. Implicit Outer Product Sometimes you care about the interaction of two sets of features (ad features x query features, news features x user features, etc...). Choices: 1 Expand the set of features explicitly, consuming n2 disk space. 2 Expand the features dynamically in the core of your learning algorithm.
  • 14. Implicit Outer Product Sometimes you care about the interaction of two sets of features (ad features x query features, news features x user features, etc...). Choices: 1 Expand the set of features explicitly, consuming n2 disk space. 2 Expand the features dynamically in the core of your learning algorithm. Option (2) is x10 faster. You need to be comfortable with hashes rst.
  • 15. The tricks Basic VW Newer Algorithmics Parallel Stu Feature Caching Adaptive Learning Parameter Averaging Feature Hashing Importance Updates Nonuniform Average Online Learning Dim. Correction Gradient Summing Implicit Features L-BFGS Hadoop AllReduce Hybrid Learning Next: algorithmics.
  • 16. Adaptive Learning [DHS10,MS10] For example t , let g it = 2(y − y )xit . ˆ
  • 17. Adaptive Learning [DHS10,MS10] For example t , let g it = 2(y − y )xit . ˆ New update rule: w i ← wi − η √Pit,t +1 g 2 t =1 git
  • 18. Adaptive Learning [DHS10,MS10] For example t , let g it = 2(y − y )xit . ˆ New update rule: w i ← wi − η √Pit,t +1 g 2 t =1 git Common features stabilize quickly. Rare features can have large updates.
  • 19. Learning with importance weights [KL11] y
  • 20. Learning with importance weights [KL11] wt x y
  • 21. Learning with importance weights [KL11] −η( ) x wt x y
  • 22. Learning with importance weights [KL11] −η( ) x wt x wt+1 x y
  • 23. Learning with importance weights [KL11] −6η( ) x wt x y
  • 24. Learning with importance weights [KL11] −6η( ) x wt x y wt+1 x ??
  • 25. Learning with importance weights [KL11] −η( ) x wt x y wt+1 x
  • 26. Learning with importance weights [KL11] wt x wt+1 x y
  • 27. Learning with importance weights [KL11] s(h)||x||2 wt x wt+1 x y
  • 28. Robust results for unweighted problems astro - logistic loss spam - quantile loss 0.97 0.98 0.96 0.97 0.96 0.95 0.95 standard standard 0.94 0.94 0.93 0.93 0.92 0.92 0.91 0.91 0.9 0.9 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 importance aware importance aware rcv1 - squared loss webspam - hinge loss 0.95 1 0.945 0.99 0.94 0.98 0.935 0.97 0.93 0.96 standard standard 0.925 0.95 0.92 0.94 0.915 0.93 0.91 0.92 0.905 0.91 0.9 0.9 0.9 0.905 0.91 0.915 0.92 0.925 0.93 0.935 0.94 0.945 0.95 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 importance aware importance aware
  • 29. Dimensional Correction Gradient of squared loss = ∂(fw (x )−y )2 ∂ wi f x = 2( w ( ) − y )x i and change weights in the negative gradient direction: ∂(fw (x ) − y )2 w i ← wi − η ∂ wi
  • 30. Dimensional Correction Gradient of squared loss = ∂(fw (x )−y )2 ∂ wi = 2( w ( ) − f x y )x i and change weights in the negative gradient direction: ∂(fw (x ) − y )2 w i ← wi − η ∂ wi But the gradient has intrinsic problems. w naturally has units i of 1/i since doubling x implies halving w to get the same i i prediction. ⇒ Update rule has mixed units!
  • 31. Dimensional Correction Gradient of squared loss = ∂(fw (x )−y )2 ∂ wi = 2( w ( ) − f x y )x i and change weights in the negative gradient direction: ∂(fw (x ) − y )2 wi ← wi − η ∂ wi But the gradient has intrinsic problems. w naturally has unitsi of 1/i since doubling x implies halving w to get the same i i prediction. ⇒ Update rule has mixed units! A crude x: divide update by x 2. It helps much! i i (fw (x )−y )2 This is scary! The problem optimized is minw P 2 x ,y i xi rather than minw x ,y (fw (x ) − y )2 . But it works.
  • 32. LBFGS [Nocedal80] Batch(!) second order algorithm. Core idea = ecient approximate Newton step.
  • 33. LBFGS [Nocedal80] Batch(!) second order algorithm. Core idea = ecient approximate Newton step. H = ∂ (∂w (i ∂)−j ) 2 f x y 2 = Hessian. w w Newton step = w → w + H −1g .
  • 34. LBFGS [Nocedal80] Batch(!) second order algorithm. Core idea = ecient approximate Newton step. H = ∂ (∂w (i ∂)−j ) 2 f x y 2 = Hessian. w w Newton step = w → w + H −1g . Newton fails: you can't even represent H. Instead build up approximate inverse Hessian according to: ∆w ∆Tw where ∆ is a change in weights ∆T ∆g w w w and ∆g is a change in the loss gradient g.
  • 35. Hybrid Learning Online learning is GREAT for getting to a good solution fast. LBFGS is GREAT for getting a perfect solution.
  • 36. Hybrid Learning Online learning is GREAT for getting to a good solution fast. LBFGS is GREAT for getting a perfect solution. Use Online Learning, then LBFGS 0.484 0.55 0.482 0.5 0.48 0.45 0.478 auPRC auPRC 0.476 0.4 0.474 0.35 0.472 0.3 0.47 Online Online L−BFGS w/ 5 online passes 0.468 L−BFGS w/ 5 online passes 0.25 L−BFGS w/ 1 online pass L−BFGS w/ 1 online pass L−BFGS 0.466 L−BFGS 0.2 0 10 20 30 40 50 0 5 10 15 20 Iteration Iteration
  • 37. The tricks Basic VW Newer Algorithmics Parallel Stu Feature Caching Adaptive Learning Parameter Averaging Feature Hashing Importance Updates Nonuniform Average Online Learning Dim. Correction Gradient Summing Implicit Features L-BFGS Hadoop AllReduce Hybrid Learning Next: Parallel.
  • 38. Applying for a fellowship in 1997
  • 39. Applying for a fellowship in 1997 Interviewer: So, what do you want to do?
  • 40. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI.
  • 41. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How?
  • 42. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines!
  • 43. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels!
  • 44. Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I'd like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels! The worst part: he had a point.
  • 45. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor f (x ) = w i wx? i i
  • 46. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor f (x ) = w i wx? i i 2.1T sparse features 17B Examples 16M parameters 1K nodes
  • 47. Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor f (x ) = w i wx? i i 2.1T sparse features 17B Examples 16M parameters 1K nodes 70 minutes = 500M features/second: faster than the IO bandwidth of a single machine⇒ we beat all possible single machine linear learning algorithms.
  • 48. Features/s 100 1000 10000 100000 1e+06 1e+07 1e+08 RBF-SVM 1e+09 MPI?-500 RCV1 Ensemble Tree MPI-128 Synthetic single parallel RBF-SVM TCP-48 Parallel Learning book MNIST 220K Decision Tree MapRed-200 Ad-Bounce # Boosted DT MPI-32 Speed per method Ranking # Linear Threads-2 RCV1 Linear adoop+TCP-1000 Ads * Compare: Other Supervised Algorithms in
  • 49. MPI-style AllReduce Allreduce initial state 5 7 6 1 2 3 4
  • 50. MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28
  • 51. MPI-style AllReduce Create Binary Tree 7 5 6 1 2 3 4
  • 52. MPI-style AllReduce Reducing, step 1 7 8 13 1 2 3 4
  • 53. MPI-style AllReduce Reducing, step 2 28 8 13 1 2 3 4
  • 54. MPI-style AllReduce Broadcast, step 1 28 28 28 1 2 3 4
  • 55. MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast
  • 56. MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast Properties: 1 Easily pipelined so no latency concerns. 2 Bandwidth ≤ 6n . 3 No need to rewrite code!
  • 57. An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number max) 1 While (examples left) 1 Do online update. 2 AllReduce(weights) 3 For each weight w ← w /n
  • 58. An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number max) 1 While (examples left) 1 Do online update. 2 AllReduce(weights) 3 For each weight w ← w /n Other algorithms implemented: 1 Nonuniform averaging for online learning 2 Conjugate Gradient 3 LBFGS
  • 59. What is Hadoop AllReduce? Program Data 1 Map job moves program to data.
  • 60. What is Hadoop AllReduce? Program Data 1 Map job moves program to data. 2 Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. Failures autorestart on dierent node with identical data.
  • 61. What is Hadoop AllReduce? Program Data 1 Map job moves program to data. 2 Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. Failures autorestart on dierent node with identical data. 3 Speculative execution: In a busy cluster, one node is often slow. Hadoop can speculatively start additional mappers. We use the rst to nish reading all data once.
  • 62. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2).
  • 63. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2). 2 Use map-only Hadoop for process control and error recovery.
  • 64. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2). 2 Use map-only Hadoop for process control and error recovery. 3 Use AllReduce code to sync state.
  • 65. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2). 2 Use map-only Hadoop for process control and error recovery. 3 Use AllReduce code to sync state. 4 Always save input examples in a cachele to speed later passes.
  • 66. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2). 2 Use map-only Hadoop for process control and error recovery. 3 Use AllReduce code to sync state. 4 Always save input examples in a cachele to speed later passes. 5 Use hashing trick to reduce input complexity.
  • 67. Approach Used 1 Optimize hard so few data passes required. 1 Normalized, adaptive, safe, online, gradient descent. 2 L-BFGS 3 Use (1) to warmstart (2). 2 Use map-only Hadoop for process control and error recovery. 3 Use AllReduce code to sync state. 4 Always save input examples in a cachele to speed later passes. 5 Use hashing trick to reduce input complexity. Open source in Vowpal Wabbit 6.1. Search for it.
  • 68. Robustness Speedup Speed per method 10 Average_10 9 Min_10 8 Max_10 linear 7 Speedup 6 5 4 3 2 1 0 10 20 30 40 50 60 70 80 90 100 Nodes
  • 69. Splice Site Recognition 0.55 0.5 0.45 auPRC 0.4 0.35 0.3 Online L−BFGS w/ 5 online passes 0.25 L−BFGS w/ 1 online pass L−BFGS 0.2 0 10 20 30 40 50 Iteration
  • 70. Splice Site Recognition 0.6 0.5 0.4 auPRC 0.3 0.2 L−BFGS w/ one online pass 0.1 Zinkevich et al. Dekel et al. 0 0 5 10 15 20 Effective number of passes over data
  • 71. To learn more The wiki has tutorials, examples, and help: https://github.com/JohnLangford/vowpal_wabbit/wiki Mailing List: vowpal_wabbit@yahoo.com Various discussion: http://hunch.net Machine Learning (Theory) blog
  • 72. Bibliography: Original VW Caching L. Bottou. Stochastic Gradient Descent Examples on Toy Problems, http://leon.bottou.org/projects/sgd, 2007. http: Release Vowpal Wabbit open source project, //github.com/JohnLangford/vowpal_wabbit/wiki, 2007. Hashing Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and SVN Vishwanathan, Hash Kernels for Structured Data, AISTAT 2009. Hashing K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, Feature Hashing for Large Scale Multitask Learning, ICML 2009.
  • 73. Bibliography: Algorithmics L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with Limited Storage, Mathematics of Computation 35:773782, 1980. Adaptive H. B. McMahan and M. Streeter, Adaptive Bound Optimization for Online Convex Optimization, COLT 2010. Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, COLT 2010. Safe N. Karampatziakis, and J. Langford, Online Importance Weight Aware Updates, UAI 2011.
  • 74. Bibliography: Parallel grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A Scalable Modular Convex Solver for Regularized Risk Minimization, KDD 2007. avg. 1 G. Mann et al. Ecient large-scale distributed training of conditional maximum entropy models, NIPS 2009. avg. 2 K. Hall, S. Gilpin, and G. Mann, MapReduce/Bigtable for Distributed Optimization, LCCC 2010. ov. avg M. Zinkevich, M. Weimar, A. Smola, and L. Li, Parallelized Stochastic Gradient Descent, NIPS 2010. P. online D. Hsu, N. Karampatziakis, J. Langford, and A. Smola, Parallel Online Learning, in SUML 2010. D. Mini 1 O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, Optimal Distributed Online Predictions Using Minibatch, http://arxiv.org/abs/1012.1367
  • 75. Vowpal Wabbit Goals for Future Development 1 Native learning reductions. Just like more complicated losses. In development now. 2 Librarication, so people can use VW in their favorite language. 3 Other learning algorithms, as interest dictates. 4 Various further optimizations. (Allreduce can be improved by a factor of 3...)
  • 76. Reductions Goal: minimize on D Transform D into D Algorithm for optimizing 0/1 h Transform h with small 0/1(h, D ) into R with small (R , D ). h h such that if h does well on (D , 0,1 ), R is guaranteed to do h well on (D , ).
  • 77. The transformation R = transformer from complex example to simple example. R −1 = transformer from simple predictions to complex prediction.
  • 78. example: One Against All Create k binary regression problems, one per class. For class i predict Is the label i or not? (x , 1(y = 1))  (x , 1(y = 2))   (x , y ) −→ . . . (x , 1(y = k ))    Multiclass prediction: evaluate all the classiers and choose the largest scoring label.
  • 79. The code: oaa.cc // parses reduction-specic ags. void parse_ags(size_t s, void (*base_l)(example*), void (*base_f )()) // Implements R and R −1 using base_l. void learn(example* ec) // Cleans any temporary state and calls base_f. void nish() The important point: anything tting this interface is easy to code in VW now, including all forms of feature diddling and creation. And reductions inherit all the input/output/optimization/parallelization of VW!
  • 80. Reductions implemented 1 One-Against-All ( oaa k). The baseline multiclass reduction. 2 Cost Sensitive One-Against-All ( csoaa k). Predicts cost of each label and minimizes the cost. 3 Weighted All-Pairs ( wap k). An alternative to csoaa with better theory. 4 Cost Sensitive One-Against-All with Label Dependent Features ( csoaa_ldf). As csoaa, but features not shared between labels. 5 WAP with Label Dependent Features ( wap_ldf). 6 Sequence Prediction ( sequence k). A simple implementation of Searn and Dagger for sequence prediction. Uses cost sensitive predictor.
  • 81. Reductions to Implement Regret Transform Reductions AUC Ranking 1 Regret multiplier Quicksort Algorithm Name Classification Ei Costing 1 Quantile Regression IW Classification 1 Mean Regression Quanting Probing Offset k−1Tree 4 ECT 4 PECOC k−Partial Label k−Classification k−way Regression k/2 Filter Tree k−cost Classification Tk PSDP Tk ln T Searn T step RL with State Visitation T Step RL with Demonstration Policy ?? ?? Dynamic Models Unsupervised by Self Prediction