SlideShare a Scribd company logo
1 of 86
Introduction to SVMs
SVMs

• Geometric
  – Maximizing Margin
• Kernel Methods
  – Making nonlinear decision boundaries linear
  – Efficiently!
• Capacity
  – Structural Risk Minimization
SVM History
• SVM is a classifier derived from statistical learning
  theory by Vapnik and Chervonenkis
• SVM was first introduced by Boser, Guyon and
  Vapnik in COLT-92
• SVM became famous when, using pixel maps as
  input, it gave accuracy comparable to NNs with
  hand-designed features in a handwriting
  recognition task
• SVM is closely related to:
   – Kernel machines (a generalization of SVMs), large
     margin classifiers, reproducing kernel Hilbert space,
     Gaussian process, Boosting
α
Linear Classifiers
                x    f                         yest
                     f(x,w,b) = sign(w. x - b)
   denotes +1
   denotes -1


                             How would you
                             classify this data?
α
Linear Classifiers
                x    f                         yest
                     f(x,w,b) = sign(w. x - b)
   denotes +1
   denotes -1


                             How would you
                             classify this data?
α
Linear Classifiers
                x    f                         yest
                     f(x,w,b) = sign(w. x - b)
   denotes +1
   denotes -1


                             How would you
                             classify this data?
α
Linear Classifiers
                x    f                         yest
                     f(x,w,b) = sign(w. x - b)
   denotes +1
   denotes -1


                             How would you
                             classify this data?
α
Linear Classifiers
                x    f                         yest
                     f(x,w,b) = sign(w. x - b)
   denotes +1
   denotes -1


                             Any of these
                             would be fine..


                             ..but which is
                             best?
α
Classifier Margin
                x   f                      yest
                    f(x,w,b) = sign(w. x - b)
   denotes +1
   denotes -1               Define the margin
                            of a linear
                            classifier as the
                            width that the
                            boundary could be
                            increased by
                            before hitting a
                            datapoint.
α
Maximum Margin
               x          f                      yest
                          f(x,w,b) = sign(w. x - b)
  denotes +1
  denotes -1                      The maximum
                                  margin linear
                                  classifier is the
                                  linear classifier
                                  with the maximum
                                  margin.
                                  This is the
                                  simplest kind of
                                  SVM (Called an
                                  LSVM)
                   Linear SVM
α
Maximum Margin
                    x          f                      yest
                               f(x,w,b) = sign(w. x - b)
       denotes +1
       denotes -1                      The maximum
                                       margin linear
                                       classifier is the
                                       linear classifier
Support Vectors                        with the, um,
are those
datapoints that                        maximum margin.
the margin
                                       This is the
pushes up
against                                simplest kind of
                                       SVM (Called an
                                       LSVM)
                        Linear SVM
Why Maximum Margin?
                    1. Intuitively this feels safest.
                                  f(x,w,b) = sign(w. x - b)
                    2. If we’ve made a small error in the
       denotes +1
                       location of the boundary (it’s been
       denotes -1                         The maximum
                       jolted in its perpendicular direction)
                       this gives us least chance linear
                                          margin of causing a
                       misclassification. classifier is the
                                        linear classifier
                    3. There’s some theory (using VC
Support Vectors        dimension) that is related to um, not
                                        with the, (but
are those              the same as) the proposition that this
datapoints that                         maximum margin.
                       is a good thing.
the margin
                                         This is the
                    4. Empirically it works very very well.
pushes up
against                                   simplest kind of
                                          SVM (Called an
                                          LSVM)
A “Good” Separator


                          O           O
    X       X
        X                         O
X                 O       O
        X   X                         O
X                     O
        X                     O
Noise in the Observations


                        O           O
    X       X
        X                       O
X               O       O
        X   X                       O
X                   O
        X                   O
Ruling Out Some Separators


                        O           O
    X       X
        X                       O
X               O       O
        X   X                       O
 X                  O
        X                   O
Lots of Noise


                            O           O
    X       X
        X                           O
X                   O       O
        X   X                           O
X                       O
        X                       O
Maximizing the Margin


                        O           O
    X       X
        X                       O
X               O       O
        X   X                       O
X                   O
        X                   O
Specifying a line and margin

                            ”
                     = +1                 Plus-Plane
                     s
                C las e                           Classifier Boundary
            dict zon
      “ Pre                                     Minus-Plane
                                  = -1”
                                  s
                           C l as
                       dict zone
                 “ Pre



• How do we represent this mathematically?
• …in m input dimensions?
Specifying a line and margin
                                 ”
                          = +1                 Plus-Plane
                         s
                    C las e                            Classifier Boundary
                 ict zon
            “Pred                                    Minus-Plane
                                       = -1”
                                       s
            =1                  C l as
        wx
          +b
               =0           dict zone
            + b -1
         w x b=       “ Pre
              +
           wx


• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
   Classify   +1           if w . x + b >= 1
   as..
              -1          if w . x + b <= -1
             Universe if -1 < w . x + b < 1
             explodes
Computing the margin width
                                  ”
                           = +1                 M = Margin Width
                          s
                     C las e
                  ict zon
             “Pred
                                        = -1”   How do we compute
                                        s
             =1                  C l as           M in terms of w
         wx
           +b
                =0           dict zone
             + b -1
          w x b=       “ Pre                      and b?
               +
            wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus
  Plane. Why?
Computing the margin width
                                       ”
                                = +1                 M = Margin Width
                               s
                          C las e
                       ict zon
                  “Pred
                                             = -1”   How do we compute
                                             s
                  =1                  C l as           M in terms of w
              wx
                +b
                     =0           dict zone
                  + b -1
               w x b=       “ Pre                      and b?
                    +
                 wx

 • Plus-plane = { x : w . x + b = +1 }
 • Minus-plane = { x : w . x + b = -1 }
 Claim: The vector w is perpendicular to the Plus
   Plane. Why?          Let u and v be two vectors on the
                                           Plus Plane. What is w . ( u – v ) ?

And so of course the vector w is also
  perpendicular to the Minus Plane
Computing the margin width
                                     ”
                              = +1                        M = Margin Width
                             s           x+
                        C las e
                     ict zon
                “Pred
                                              =- -
                                                     1”   How do we compute
                                            x
                =1                   l as s                 M in terms of w
              +b               i ct C zone
            wx     =0
                + b -1    “Pred                             and b?
             w x b=
                  +
               wx

•   Plus-plane = { x : w . x + b = +1 }
•   Minus-plane = { x : w . x + b = -1 }
•   The vector w is perpendicular to the Plus Plane in
                                                  Any location
                                                  R : not
                                                  A : not                    m
                                                                             m

•   Let x- be any point on the minus plane        necessarily a
                                                  datapoint
•   Let x+ be the closest plus-plane-point to x-.
Computing the margin width
                                   ”
                            = +1                        M = Margin Width
                           s           x+
                      C las e
                   ict zon
              “Pred
                                            =- -
                                                   1”   How do we compute
                                          x
              =1                   l as s                 M in terms of w
            +b               i ct C zone
          wx     =0
              + b -1    “Pred                             and b?
           w x b=
                +
             wx

•   Plus-plane = { x : w . x + b = +1 }
•   Minus-plane = { x : w . x + b = -1 }
•   The vector w is perpendicular to the Plus Plane
•   Let x- be any point on the minus plane
•   Let x+ be the closest plus-plane-point to x-.
•   Claim: x+ = x- + λ w for some value of λ. Why?
Computing the margin width
                                  ”
                           = +1            M = Margin Width
                           s          x+
                      C las e
                   ict zon                    The line from x- to x+ is
              “Pred
                                         --
                                            1” How do we compute
                                                perpendicular to the
                                       =
                                    ss x        planes.terms of w
                                                  M in
              b=
                 1               Cla e
                               ct zon
            +                i
          wx     =0
              + b -1    “Pred                 So to getbfrom x- to x+
                                                  and ?
           w x b=
                +
             wx                                 travel some distance in
•   Plus-plane =         { x : w . x + b =direction w.
                                                  +1 }
•   Minus-plane = { x : w . x + b = -1 }
•   The vector w is perpendicular to the Plus Plane
•   Let x- be any point on the minus plane
•   Let x+ be the closest plus-plane-point to x-.
•   Claim: x+ = x- + λ w for some value of λ. Why?
Computing the margin width
                              ”
                       = +1                        M = Margin Width
                      s           x+
                 C las e
              ict zon
         “Pred
                                              1”
                                     x =- -
         =1                   l as s
       +b               i ct C zone
     wx     =0
         + b -1    “Pred
      w x b=
           +
        wx



What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + λ w
• |x+ - x- | = M
It’s now easy to get M in terms of w and b
Computing the margin width
                              ”
                       = +1                        M = Margin Width
                      s           x+
                 C las e
              ict zon
         “Pred
                                              1”
                                     x =- -
         =1                   l as s
       +b               i ct C zone
     wx     =0
         + b -1    “Pred                           w . (x - + λ w) + b = 1
      w x b=
           +
        wx
                                                   =>

What we know:                w . x + b + λ w .w = 1      -


• w . x+ + b = +1            =>
• w . x- + b = -1            -1 + λ w .w = 1
• x+ = x- + λ w              =>          2
                                    λ=
• |x+ - x- | = M                        w.w
It’s now easy to get M in terms of w and b
Computing the margin width
                               ”                                         2
                        = +1                        M = Margin Width =
                       s           x+
                  C las e                                                w.w
               ict zon
          “Pred
                                               1”
                                      x =- -
          =1                   l as s
        +b               i ct C zone
      wx     =0
          + b -1    “Pred               M=          |x+ - x- | =| λ w |=
       w x b=
            +
         wx

What we know:                                  = λ | w | = λ w.w
• w . x+ + b = +1
                                                  2 w.w   2
• w . x- + b = -1                               =       =
                                                   w.w    w.w
• x+ = x- + λ w
• |x+ - x- | = M
         2
•  λ=
        w.w
Learning the Maximum Margin Classifier
                                  ”                                         2
                           = +1                        M = Margin Width =
                          s           x+
                     C las e                                                w.w
                  ict zon
             “Pred
                                                  1”
                                         x =- -
             =1                   l as s
           +b               i ct C zone
         wx     =0
             + b -1    “Pred
          w x b=
               +
            wx

Given a guess of w and b we can
• Compute whether all data points in the correct half-planes
• Compute the width of the margin
So now we just need to write a program to search the space
  of w’s and b’s to find the widest margin that matches all the
  datapoints. How?
Gradient descent? Simulated Annealing? Matrix Inversion?
  EM? Newton’s Method?
Don’t worry…
                                                    it’s good for
                                                        you…
• Linear Programming

   find w
   argmax c⋅w
   subject to
      w⋅ai ≤ bi, for i = 1, …, m
      wj ≥ 0 for j = 1, …, n
There are fast algorithms for solving linear programs including the
simplex algorithm and Karmarkar’s algorithm
Learning via Quadratic Programming

• QP is a well-studied class of optimization
  algorithms to maximize a quadratic function
  of some real-valued variables subject to
  linear constraints.
Quadratic Programming
                           uT Ru
 Find   arg max c + dT u +                                           Quadratic criterion
            u                2

Subject to        a11u1 + a12u 2 + ... + a1mum ≤ b1
                  a21u1 + a22u 2 + ... + a2 mu m ≤ b2                    n additional linear
                                                                            inequality
                                   :                                        constraints
                  an1u1 + an 2u 2 + ... + anmu m ≤ bn
And subject to




                                                                                   e additional linear
             a( n +1)1u1 + a( n +1) 2u2 + ... + a( n +1) mu m = b( n +1)




                                                                                      constraints
                                                                                      equality
             a( n + 2)1u1 + a( n + 2 ) 2u2 + ... + a( n + 2 ) mu m = b( n + 2 )
                                             :
             a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e )
Quadratic Programming
                            uT Ru
  Find   arg max c + dT u +                                              Quadratic criterion
             u                2

Subject to          a11u1 + a12u 2 + ... + a1mum ≤ b1g
                                                          in
                                                  for find
                           re ex
                                  u lg ... h e s qu m ≤ ic
                    a21u1 + a22ist2a+ orit+ md2 muadratb2
                                               a                              n additional linear
                      Th e         constra
                                            in             tl y
                             s uch      : ore      efficien                      inequality
                                    much m n gradient                            constraints
                            optima              a
                                     lia...y th u ≤ b
                                         bl
                    an1u1 + ann2d re+ ascennm m
                               a u2         + a t.          n
                                                            ou
And subject to                               very f iddly…y e




                                                                                       e additional linear
                                         are+ ... + a t to writ = b
               a( n +1)1u1 (+ ut( nh1)y u2 n’t wan( n +1) mu m
                            B a t + e 2 do                         ( n +1)




                                                                                          constraints
                                                                                          equality
                                        ly
                              probab          yoursel
                                                       f)
              a         u +a
                  ( n + 2 )1 1          uon+ ... + a
                                            e
                                   ( n+ 2) 2 2              u =b
                                                           ( n+ 2) m m        ( n+2)

                                                 :
                 a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e )
Learning the Maximum Margin Classifier
                                ”             M=
                         = +1                  2 Given    guess of w , b we
                         s
                    C las e                    w.w can
                dict zon
          “ Pre                                    • Compute whether all
                                      = -1”
                                      s
          =1                   C l as                data points are in the
wx
     +b
        0                  dict zone
   + b=
 wx b=-1             “ Pre                           correct half-planes
     +
  wx
                                                   • Compute the margin
                                                     width
What should our                                    How many constraints
                                                   Assume R datapoints,
 quadratic optimization                              each (xhave? R yk =
                                                     will we k,yk) where
 criterion be?                                     What 1
                                                     +/- should they be?
Minimize w.w                                       w . xk + b >= 1 if yk = 1
                                                   w . xk + b <= -1 if yk = -1
This is going to be a problem!
         Uh-oh!
                  What should we do?
denotes +1        Idea 1:
denotes -1
                    Find minimum w.w, while
                    minimizing number of
                    training set errors.
                       Problem: Two things to
                       minimize makes for an
                       ill-defined optimization
This is going to be a problem!
         Uh-oh!
                  What should we do?
denotes +1        Idea 1.1:
denotes -1
                    Minimize
                     w.w + C (#train errors)
                         Tradeoff parameter



                  There’s a serious practical
                  problem that’s about to make
                  us reject this approach. Can
                  you guess what it is?
This is going to be a problem!
         Uh-oh!
                               What should we do?
denotes +1                     Idea 1.1:
denotes -1
                                   Minimize
                                    w.w + C (#train errors)
                                           Tradeoff parameter
         Can’t be expressed as a Quadratic
                Programming problem.
             Solving it may be too slow.
                                There’s a serious practical
         (Also, doesn’t distinguish between
                                problem that’s abouteyr make
          disastrous errors and near misses)
                                                        n
                                                 So… a to
                                                    o th
                                us reject this approach. Can
                                                    ideas?
                                you guess what it is?
This is going to be a problem!
         Uh-oh!
                  What should we do?
denotes +1        Idea 2.0:
denotes -1
                    Minimize
                    w.w + C (distance of error
                             points to their
                             correct place)
Learning Maximum Margin with Noise
                     M=
                          2 Given   guess of w , b we
                          w.w can
                            • Compute sum of
       b   =1                 distances of points to
w   x+     0
      + b=
    wx b=-1
                              their correct zones
        +
     wx
                            • Compute the margin
                              width
What should our             Assume R datapoints,
 quadratic optimization       each (xk,yk) where yk =
 criterion be?                +/- 1
                            How many constraints
                              will we have?
                            What should they be?
Large-margin Decision Boundary
•   The decision boundary should be as far away from the data of both
    classes as possible
     – We should maximize the margin, m
     – Distance between the origin and the line wtx=k is k/||w||




                               Class 2


                           m
         Class 1


    05/18/12                                                            39
Finding the Decision Boundary
• Let {x1, ..., xn} be our data set and let yi ∈ {1,-1} be the
  class label of xi
• The decision boundary should classify all points correctly ⇒

• The decision boundary can be found by solving the
  following constrained optimization problem




• This is a constrained optimization problem. Solving it
  requires some new tools
    – Feel free to ignore the following several slides; what is important is
   05/18/12 constrained optimization problem above
       the                                                                   40
Back to the Original Problem


   • The Lagrangian is



           – Note that ||w||2 = wTw
   •    Setting the gradient of       w.r.t. w and b to zero, we
       have




05/18/12                                                           41
• The Karush-Kuhn-Tucker conditions,




 05/18/12                              42
The Dual Problem
• If we substitute                to   , we have




• Note that
• This is a function of αi only
  05/18/12                                         43
The Dual Problem
•   The new objective function is in terms of αi only
•   It is known as the dual problem: if we know w, we know all αi; if we
    know all αi, we know w
•   The original problem is known as the primal problem
•   The objective function of the dual problem needs to be maximized!
•   The dual problem is therefore:




    Properties of αi when we introduce       The result when we differentiate the
    the Lagrange multipliers                 original Lagrangian w.r.t. b      44
The Dual Problem




•   This is a quadratic programming (QP) problem
     – A global maximum of αi can always be found

•   w can be recovered by




    05/18/12                                        45
Characteristics of the Solution
•   Many of the αi are zero
     – w is a linear combination of a small number of data points
     – This “sparse” representation can be viewed as data compression
        as in the construction of knn classifier
•   xi with non-zero αi are called support vectors (SV)
     – The decision boundary is determined only by the SV
     – Let tj (j=1, ..., s) be the indices of the s support vectors. We can
       write
•   For testing with a new data z
     – Compute                                                  and classify
       z as class 1 if the sum is positive, and class 2 otherwise
     – Note: w need not be formed explicitly


05/18/12                                                                      46
A Geometrical Interpretation
                         Class 2

                                     α10=0
                       α8=0.6


                                             α7=0
                                                    α2=0
      α5=0

                                               α1=0.8
           α4=0
                                α6=1.4

             α9=0
                          α3=0
             Class 1

05/18/12                                                   47
Non-linearly Separable Problems
•   We allow “error” ξi in classification; it is based on the output of the
    discriminant function wTx+b
•   ξi approximates the number of misclassified samples


                                               Class 2




                   Class 1
    05/18/12                                                                  48
Learning Maximum Margin with Noise
                        M=
                ε11       2 Given   guess of w , b we
        ε2                w.w can
                            • Compute sum of
   +b
      =1                       distances of points to
wx        0       ε7
     + b=
  wx b=-1
                               their correct zones
       +
    wx
                            • Compute the margin
                               width
What should our           How many R datapoints,
                            Assume constraints will
    quadratic optimization we have?k,yk) where yk =
                               each (x R
    criterion be?     R   What should they be?
                               +/- 1
             1
Minimize w.w + C ∑ εk w . x + b >= 1-ε if y = 1
             2       k =1
                                k           k    k

                          w . xk + b <= -1+εk if yk = -1
Learning Maximum Margin with Noise
                        m = # input
                           M=                 dimensions
                                2 Given guess of w , b we
                   ε11
            ε2                 w.w can
                                  • Compute sum m+1
                 Our original (noiseless data) QP had of
        1          variables: w1, w2, … wm, and of points to
                                     distances b.
     b=
wx
   +
          0         ε7
                 Our new (noisy data) QP correct zones
                                     their has m+1+R
     + b=
  wx b=-1
       +
    wx
                   variables: w1, • 2, … wm, b, εthe1margin
                                  w Compute k , ε ,… εR
                                     width
What should our                 How many R datapoints,
                                  Assume constraints records
                                                      R= # will
     quadratic optimization we have?k,yk) where yk =
                                     each (x R
     criterion be?       R      What should they be?
                                     +/- 1
               1
Minimize w.w + C ∑ εk w . x + b >= 1-ε if y = 1
               2        k =1
                                      k            k     k

                                w . xk + b <= -1+εk if yk = -1
Learning Maximum Margin with Noise
                           M=
                  ε11        2 Given    guess of w , b we
         ε2                  w.w can
                                 • Compute sum of
   +b
      =1                           distances of points to
wx        0         ε7
     + b=
  wx b=-1
                                   their correct zones
       +
    wx
                                 • Compute the margin
                                   width
What should our               How many R datapoints,
                                 Assume constraints will
    quadratic optimization we have?k,yk) where yk =
                                   each (x R
    criterion be?        R    What should they be?
                                   +/- 1
             1
Minimize w.w + C ∑ εk w . x + b >= 1-ε if y = 1
             2          k =1
                                    k            k     k

  There’s a bug in this QP. Can you kspot it? -1+εk if yk = -1
                              w . x + b <=
Learning Maximum Margin with Noise
                        M=
                ε11       2 Given   guess of w , b we
        ε2                w.w can
                            • Compute sum of
   +b
      =1                       distances of points to
wx        0       ε7
     + b=
  wx b=-1
                               their correct zones
       +
    wx
                            • Compute the margin
                               width
                           How many constraints will
What should our             Assume R 2R
                             we have? datapoints,
    quadratic optimization What should they be? yk =
                               each (xk,yk) where
    criterion be?
             1         R   w . xk +1b >= 1-εk if yk = 1
                               +/-
Minimize w.w + C ∑ εk
             2        k =1 w . xk + b <= -1+εk if yk = -1
                           εk >= 0 for all k
Learning Maximum Margin with Noise
                        M=
                ε11       2 Given   guess of w , b we
        ε2                w.w can
                            • Compute sum of
   +b
      =1                       distances of points to
wx        0       ε7
     + b=
  wx b=-1
                               their correct zones
       +
    wx
                            • Compute the margin
                               width
                           How many constraints will
What should our             Assume R 2R
                             we have? datapoints,
    quadratic optimization What should they be? yk =
                               each (xk,yk) where
    criterion be?
             1         R   w . xk +1b >= 1-εk if yk = 1
                               +/-
Minimize w.w + C ∑ εk
             2        k =1 w . xk + b <= -1+εk if yk = -1
                           εk >= 0 for all k
An Equivalent Dual QP

Minimize
                R
                            w . xk + b >= 1-εk if yk = 1
     1
       w.w + C ∑ εk         w . xk + b <= -1+εk if yk = -1
     2         k =1
                            εk >= 0 ,     for all k


          R
               1 R R
Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl ( x k .x l )
         k =1  2 k =1 l =1
                                             R
Subject to these
  constraints:
                    0 ≤ αk ≤ C ∀k           ∑α
                                            k=1
                                                  k   yk = 0
An Equivalent Dual QP
            R
               1 R R
Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl ( x k .x l )
         k =1  2 k =1 l =1
                                               R
 Subject to these
   constraints:
                     0 ≤ αk ≤ C ∀k            ∑α
                                              k=1
                                                        k   yk = 0

Then define:
       R
w = ∑ k yk x k
     α                            Then classify with:

      k=1                         f(x,w,b) = sign(w. x - b)

b = y K (1 −ε K ) −x K .w K
 where K = arg max αk
                     k
Example XOR problem revisited:

Let the nonlinear mapping be :

     φ(x) = (1,x12, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T
And: φ(xi)=(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T

Therefore the feature space is in 6D with input data in 2D

      x1 =   (-1,-1),   d1= - 1
      x2 =   (-1, 1),   d2 = 1
      x3 =   ( 1,-1),   d3 = 1
      x4 =   (-1,-1),   d4= -1
Q(a)= Σ ai – ½ Σ Σ ai aj di dj φ(xi) Tφ(xj)
    =a1 +a2 +a3 +a4 – ½(9 a1 a1 - 2a1 a2 -2 a1 a3 +2a1 a4
        +9a2 a2 + 2a2 a3 -2a2 a4 +9a3 a3 -2a3 a4 +9 a4 a4 )
To minimize Q, we only need to calculate
            ∂Q ( a )
             ∂a i      = 0, i = 1,...,4
(due to optimality conditions) which gives
      1 = 9 a1 - a2 - a3 + a4
      1 = -a1 + 9 a2 + a3 - a4
      1 = -a1 + a2 + 9 a3 - a4
      1 = a1 - a2 - a3 + 9 a4
The solution of which gives the optimal values:
a0,1 =a0,2 =a0,3 =a0,4 =1/8

w0 = Σ a0,i di φ(xi)
  = 1/8[φ(x1)- φ(x2)- φ(x3)+ φ(x4)]

      1   1   1                1   0 
                                         
      1   1   1                1   0 
  1   2  − 2 − 2                2   − 1 
 = −    +   −   −                  =    2
  8  1   1   1                 1   0 
     − 2 − 2  2                    
                                        2  0 
                                                    
                 
      − 2  2  − 2                   
     
                                 2   0 
                                                   

Where the first element of w0 gives the bias b
From earlier we have that the optimal hyperplane is defined by:
              w0T φ(x) = 0
That is:
                                     1 
                                       2
                                            
                                     x1 
                                     2x x 
                            0 0 0 
w0T φ(x) =  0 0 − 1                         = − x1 x2 = 0
                                        1 2
                                       2
                    2              x2 
                                     2x 
                                         1 
                                     2x 
                                         2 




 which is the optimal decision boundary for the XOR problem.
 Furthermore we note that the solution is unique since the optimal
 decision boundary is unique
Output for polynomial
RBF
Harder 1-dimensional dataset
                  Remember how
                    permitting non-
                    linear basis
                    functions made
                    linear regression
                    so much nicer?
                  Let’s permit them
                    here too

     x=0            z k = ( xk , xk )
                                  2
For a non-linearly separable problem we have to first map data onto
 feature space so that they are linear separable

                        xi                   φ(xi)

Given the training data sample {(xi,yi), i=1, …,N}, find the
optimum values of the weight vector w and bias b
                            w = Σ a0,i yi φ(xi)
where a0,i are the optimal Lagrange multipliers determined by
maximizing the following objective function
                        N
                            1 N N
             Q ( a ) = ∑ai − ∑∑ai a j d i d j φ ( x i )φ( x j )
                                               T

                       i=1  2 i =1 j =1


subject to the constraints
                   Σ ai yi =0 ;    ai >0
SVM building procedure:

• Pick a nonlinear mapping φ
• Solve for the optimal weight vector

However: how do we pick the function φ?

• In practical applications, if it is not totally
  impossible to find φ, it is very hard
• In the previous example, the function φ is quite
  complex: How would we find it?

Answer: the Kernel Trick
Notice that in the dual problem the image of input vectors
only involved as an inner product meaning that the
optimization can be performed in the (lower dimensional)
input space and that the inner product can be replaced by
an inner-product kernel
                N           N
      Q ( a ) = ∑ i − 1 ∑ai a j d i d jϕj ( x )T ϕj ( x i )
                    a
                i=1
                      2 i , j =1
                 N           N
             = ∑ i − 1 ∑ai a j d i d j K ( xi , x j )
                   a
               i=1
                     2 i , j =1

How do we relate the output of the SVM to the kernel K?

Look at the equation of the boundary in the feature space and use the
optimality conditions derived from the Lagrangian formulations
Hyperplane is defined by
              m1

             ∑w ϕ ( x ) + b = 0
              j =1
                     j   j


or
              m1

             ∑w ϕ ( x ) = 0;
              j =0
                     j   j           where ϕ0 ( x ) = 1


writing : ϕ( x ) = [ϕ0 ( x ), ϕ1 ( x ),..., ϕm1 ( x )]
             w ϕ( x ) = 0
                T
we get :
                                              N
from optimality conditions : w = ∑ai d i ϕ( x i )
                                             i =1
N

          ∑a d ϕ            ( x i )ϕ( x ) = 0
                        T
Thus :          i   i
          i=1
                                    N
and so boundary is : ∑ i d i K ( x, xi ) = 0
                      a
                                    i=1
                                          N
and Output = w ϕ( x ) = ∑ i d i K ( x, xi )
                        T
                         a
                                          i=1
                             m1
where : K ( x , xi ) = ∑ ϕj ( x )ϕj ( x i )
                             j =0
In the XOR problem, we chose to use the kernel function:
       K(x, xi) = (x T xi+1)2
             = 1+ x12 xi12 + 2 x1x2 xi1xi2 + x22 xi22 + 2x1xi1 ,+ 2x2xi2

Which implied the form of our nonlinear functions:
     φ(x) = (1,x12, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T
And: φ(xi)=(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T


However, we did not need to calculate φ at all and could simply
have used the kernel to calculate:
       Q(a) = Σ ai – ½ Σ Σ ai aj di dj Κ(xi, xj)

Maximized and solved for ai and derived the hyperplane via:
                      N

                     ∑a d K ( x, x ) = 0
                     i =1
                            i   i      i
We therefore only need a suitable choice of kernel function cf:
Mercer’s Theorem:

    Let K(x,y) be a continuous symmetric kernel that defined in the
closed interval [a,b]. The kernel K can be expanded in the form

    Κ (x,y) = φ(x) T φ(y)

provided it is positive definite. Some of the usual choices for K are:

Polynomial SVM              (x T xi+1)p      p specified by user

RBF SVM            exp(-1/(2σ2) || x – xi||2) σ specified by user

MLP SVM              tanh(s0 x T xi + s1)
An Equivalent Dual QP
            R
               1 R R
Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl ( x k .x l )
         k =1  2 k =1 l =1
                                                  R
 Subject to these
   constraints:
                    0 ≤ αk ≤ C ∀k               ∑α
                                                 k=1
                                                          k   yk = 0

Then define:                   Datapoints with αk > 0
       R                         will be the support
w = ∑ k yk x k
     α                           vectors classify with:
                                    Then

      k=1                         f(x,w,b) = sign(w. x - b)
                             ..so this sum only needs
b = y K (1 −ε K )   −x K .w K to be over the
                                 support vectors.
 where K = arg max αk
                     k

       
            1      
                   
                       Constant Term            Quadratic
          2 x1 
       
       
           2 x2 
                      Linear Terms
                                                  Basis
           :      
       
          2 xm   
                                                Functions
          x12
                   
            2          Pure          Number of terms (assuming m input
          x2      
                       Quadratic       dimensions) = (m+2)-choose-2
           :      
                      Terms
           xm2                         = (m+2)(m+1)/2
                  
Φ(x) = 
          2 x1 x2                     = (as near as makes no difference) m2/2
                  
        2 x1 x3 
           :      
                  
        2 x1 xm                      You may be wondering what those
                      Quadratic
         2 x2 x3 
                       Cross-Terms
                                        2 ’s are doing.
           :      
                                     •You should be happy that they do no
         2 x1 xm                     harm
           :      
       
        2x x     
                                       •You’ll find out why they’re there
           m −1 m                    soon.
QP with basis functions
            R
                  1 R R
Maximize   ∑ αk − 2 ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(xl ))
           k =1     k =1 l =1

                                                   R
  Subject to these
    constraints:
                           0 ≤ αk ≤ C ∀k          ∑α
                                                  k=1
                                                            k   yk = 0

 Then define:

  w=            ∑ k y k Φ( x k )
                 α                    Then classify with:
           k s.t. α k >0              f(x,w,b) = sign(w. φ (x) - b)
 b = y K (1 −ε K ) −x K .w K
  where K = arg max αk
                           k
QP with basis functions
            R
                  1 R R
Maximize   ∑ αk − 2 ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(xl ))
           k =1     k =1 l =1
                                   We must do R2/2 dot products to
                                   get this matrix ready.
                                                       R
  Subject to these         0 ≤ αk ≤ C ∀k product ∑ k m k = 0
                                    Each dot
                                                      α y
                                                 requires /2       2
    constraints:                                    k=  1
                                   additions and multiplications
                                   The whole thing costs R2 m2 /4.
 Then define:
                                   Yeeks!

  w=            ∑ k y k Φ( x k )
                 α                      Then classify with: or does it?
                                                         …
           k s.t. α k >0                f(x,w,b) = sign(w. φ (x) - b)
 b = y K (1 −ε K ) −x K .w K
  where K = arg max αk
                           k
    1         1          1
                                   
                 2a1        2b1 
                                          +
                  2 a2       2b2 
                                        m

                  :          :        ∑ 2a b      i i
                 2 am       2bm 
                                           i =1
                                   
                 a12
                            b1 2
                                          +
                   2          2    
                 a2        b2           m

                  :          :        ∑ ai2bi2
                   2
                               2
                                          i =1

                 am        bm      
Φ(a) • Φ(b) = 
              
                 2a1a2 
                           •
                           
                              2b1b2 
                                      
                                                      Quadratic
               2a1a3   2b1b3              +
              
              
                   :       
                           
                                :     
                                      
                                                      Dot Products
               2a1am   2b1bm 
                                      m       m

                2a2 a3   2b2b3        ∑ ∑ 2a a b b      i   j i   j
                  :          :        i =1 j =i +1
                                   
                2a1am   2b1bm 
                  :          :     
              
               2a a   2b b 
                                    
                  m −1 m    m −1 m 
Just out of casual, innocent, interest,
                                                      let’s look at another function of a and
Quadratic                                             b:
                                                          (a.b + 1) 2
Dot Products                                            = (a.b) 2 + 2a.b + 1
                                                                        2
                                                               m
                                                                         m
                                                        =  ∑ ai bi  + 2∑ ai bi + 1
                                                           i =1        i =1


Φ(a) • Φ(b) =                                            m      m                   m
    m        m           m      m                     = ∑∑ ai bi a j b j + 2∑ ai bi + 1
1 + 2∑ ai bi + ∑ a b + ∑
                   2 2
                   i i        ∑ 2a a b bi   j i   j
                                                        i =1 j =1                  i =1
    i =1    i =1         i =1 j =i +1
                                                         m                  m      m                         m
                                                      = ∑ (ai bi ) + 2∑
                                                                    2
                                                                                 ∑a b a b  i i   j   j   + 2∑ ai bi + 1
                                                         i =1               i =1 j =i +1                    i =1
Just out of casual, innocent, interest,
                                                      let’s look at another function of a and
Quadratic                                             b:
                                                          (a.b + 1) 2
Dot Products                                            = (a.b) 2 + 2a.b + 1
                                                                        2
                                                               m
                                                                         m
                                                        =  ∑ ai bi  + 2∑ ai bi + 1
                                                           i =1        i =1


Φ(a) • Φ(b) =                                            m      m                   m
    m        m           m      m                     = ∑∑ ai bi a j b j + 2∑ ai bi + 1
1 + 2∑ ai bi + ∑ a b + ∑
                   2 2
                   i i        ∑ 2a a b bi   j i   j
                                                        i =1 j =1                  i =1
    i =1    i =1         i =1 j =i +1
                                                         m                  m      m                         m
                                                      = ∑ (ai bi ) + 2∑
                                                                    2
                                                                                 ∑a b a b  i i   j   j   + 2∑ ai bi + 1
                                                         i =1               i =1 j =i +1                    i =1



                                                                    They’re the same!
                                                                    And this is only O(m) to
                                                                    compute!
QP with Quadratic basis functions
            R
                  1 R R
Maximize   ∑ αk − 2 ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(xl ))
           k =1     k =1 l =1
                                    We must do R2/2 dot products to
                                    get this matrix ready.
                                                        R
  Subject to these         0 ≤ αk ≤ C ∀k product ∑ k y k = 0
                                    Each dot
                                                     α
                                                 now only requires
    constraints:                                     k=1
                                    m additions and multiplications

 Then define:

  w=            ∑ k y k Φ( x k )
                 α                       Then classify with:
           k s.t. α k >0                 f(x,w,b) = sign(w. φ (x) - b)
 b = y K (1 −ε K ) −x K .w K
  where K = arg max αk
                           k
Higher Order Polynomials
Polynomial φ (x)          Cost to build Cost if 100   φ (a).φ (b)   Cost to       Cost if
                          Qkl matrix    inputs                      build Qkl     100
                          traditionally                             matrix        inputs
                                                                    efficiently
Quadratic   All m2/2      m2 R2 /4      2,500 R2      (a.b+1)2      m R2 / 2      50 R2
            terms up to
            degree 2

Cubic       All m3/6      m3 R2 /12     83,000 R2     (a.b+1)3      m R2 / 2      50 R2
            terms up to
            degree 3

Quartic     All m4/24     m4 R2 /48     1,960,000     (a.b+1)4      m R2 / 2      50 R2
            terms up to                 R2
            degree 4
QP with Quintic basis functions
            R            R    R
Maximize   ∑ α + ∑∑ α α Q
           k =1
                  k
                        k =1 l =1
                                      k   l   kl
                                                   where   Qkl = yk yl (Φ(x k ).Φ(x l ))

                                                                    R
  Subject to these
    constraints:
                               0 ≤ αk ≤ C ∀k                       ∑α
                                                                   k=1
                                                                             k   yk = 0

 Then define:

 w=         ∑α
       k s.t. α k >0
                    k   y k Φ( x k )

 b = y K (1 −ε K ) −x K .w K
                                                       Then classify with:
  where K = arg max αk
                                  k                    f(x,w,b) = sign(w. φ (x) - b)
QP with Quintic basis functions
We must do RR/2 dot products to get this
               2
                         R R

            ∑ α + ∑∑ α α Q
matrix ready.
Maximize             k            k l kl
In 100-d, each 1 product now needs 103
             k = dot    k =1 l =1
                                         where      Qkl = yk yl (Φ(x k ).Φ(x l ))
operations instead of 75 million
                                                                   R
But there are still worrying things lurking away.
  Subject to these
What are they?
      constraints:
                              0 ≤ α ≤ C ∀k
                                        k                      ∑α        ky =0
                                                                             k
                                          •The fear of overfittingk = this enormous
                                                                   with
                                                                     1
                                          number of terms
 Then define:                           •The evaluation phase (doing a set of
                                        predictions on a test set) will be very

              ∑
                                        expensive (why?)
  w=           α       k   y k Φ( x k )
          k s.t. α k >0

 b = y K (1 −ε K ) −x K .w K
                                                  Then classify with:
  where K = arg max αk
                               k                  f(x,w,b) = sign(w. φ (x) - b)
QP with Quintic basis functions
We must do RR/2 dot products to get this
               2
                         R R

            ∑ α + ∑∑ α α Q
matrix ready.
Maximize             k            k l kl
In 100-d, each 1 product now needs 103
             k = dot    k =1 l =1
                                         where      Qkl = yk yl (Φ(x k ).Φ(x l ))
                                                      The use of Maximum Margin
operations instead of 75 million                      magically makes this not a
                                                      problem R
But there are still worrying things lurking away.
  Subject to these
What are they?
      constraints:
                              0 ≤ α ≤ C ∀k
                                        k                      ∑α        ky =0
                                                                             k
                                          •The fear of overfittingk = this enormous
                                                                   with
                                                                     1
                                          number of terms
 Then define:                           •The evaluation phase (doing a set of
                                        predictions on a test set) will be very

              ∑
                                        expensive (why?)
  w=           α       k   y k Φ( x k )
          k s.t. α k >0                             Because each w. φ (x) (see below)
                                                    needs 75 million operations. What
 b = y K (1 −ε K ) −x K .w K                        can be done?
                                                  Then classify with:
  where K = arg max αk
                               k                  f(x,w,b) = sign(w. φ (x) - b)
QP with Quintic basis functions
We must do RR/2 dot products to get this
               2
                         R R

            ∑ α + ∑∑ α α Q
matrix ready.
Maximize             k            k l kl
In 100-d, each 1 product now needs 103
             k = dot    k =1 l =1
                                         where         Qkl = yk yl (Φ(x k ).Φ(x l ))
                                                         The use of Maximum Margin
operations instead of 75 million                         magically makes this not a
                                                         problem R
But there are still worrying things lurking away.
  Subject to these
What are they?
      constraints:
                                  0 ≤ α ≤ C ∀k
                                        k                         ∑α     k   y =0
                                                                             k
                                          •The fear of overfittingk = this enormous
                                                                   with
                                                                     1
                                          number of terms
 Then define:                              •The evaluation phase (doing a set of
                                           predictions on a test set) will be very

               ∑
                                           expensive (why?)
  w=            α      k       y k Φ( x k )
          k s.t. α k >0                                Because each w. φ (x) (see below)
                                                       needs 75 million operations. What
 b Φ x ( ∑ ) − x ⋅Φ
 w ⋅=(y) =1 −ε k y k Φ(xk ) .w( x)
           K
             α
                           K
               k s.t. α k >0           K      K
                                                       can be done?
                                                     Then classify with:
  where ∑ k yk ( max α
      =K = argx k ⋅ x +1) 5
          α
               k s.t. α k >0                  k
                                   k
 Only Sm operations (S=#support vectors)
                                                     f(x,w,b) = sign(w. φ (x) - b)
QP with Quintic basis functions
We must do RR/2 dot products to get this
               2
                         R R

            ∑ α + ∑∑ α α Q
matrix ready.
Maximize             k            k l kl
In 100-d, each 1 product now needs 103
             k = dot    k =1 l =1
                                         where         Qkl = yk yl (Φ(x k ).Φ(x l ))
                                                         The use of Maximum Margin
operations instead of 75 million                         magically makes this not a
                                                         problem R
But there are still worrying things lurking away.
  Subject to these
What are they?
      constraints:
                                  0 ≤ α ≤ C ∀k
                                        k                         ∑α     k   y =0
                                                                             k
                                          •The fear of overfittingk = this enormous
                                                                   with
                                                                     1
                                          number of terms
 Then define:                              •The evaluation phase (doing a set of
                                           predictions on a test set) will be very

               ∑
                                           expensive (why?)
  w=            α      k       y k Φ( x k )
          k s.t. α k >0                                Because each w. φ (x) (see below)
                                                       needs 75 million operations. What
 b Φ x ( ∑ ) − x ⋅Φ
 w ⋅=(y) =1 −ε k y k Φ(xk ) .w( x)
           K
             α
                           K
               k s.t. α k >0           K      K
                                                       can be done?
                                                     Then classify with:
  where ∑ k yk ( max α
      =K = argx k ⋅ x +1) 5
          α
               k s.t. α k >0                  k
                                   k
 Only Sm operations (S=#support vectors)
                                                     f(x,w,b) = sign(w. φ (x) - b)
QP with Quintic basis functions
            R
                  1 R R
Maximize   ∑ αk − 2 ∑∑ αk αl Qkl where Qthink:don’tyoverfit(as kmuch asxl ))
           k =1     k =1 l =1       you’d
                                          kl =
                                    Why SVMs y k l (Φ x ).Φ(


                                            No matter what the basis function,
                                            there are really R up to R
                                                             only
  Subject to these
    constraints:
                            0 ≤ αk ≤ C ∀k                 ∑α       Ry =0
                                            parameters: α1, α2 .. αk, and usually
                                                                        k
                                            most are set to zero by the Maximum
                                                            k=1
                                            Margin.

 Then define:                              Asking for small w.w is like “weight
                                           decay” in Neural Nets and like Ridge
                                           Regression parameters in Linear
  w=            ∑α
       k s.t. α k >0
                    k  y Φ( x )
                        k          k       regression and like the use of Priors
                                           in Bayesian Regression---all designed
                                           to smooth the function and reduce
w ⋅Φ( x) =  ∑
 b = y (1 −ε ) −x .wαk y k Φ( x k ) ⋅Φ( x) overfitting.
       K k s.t. α k >0
                     K          K      K
                                              Then classify with:
           ∑
       =K = argx k ⋅ x +1) 5
  where k s.t. α >0k yk ( max αk
                 α
                                              f(x,w,b) = sign(w. φ (x) - b)
                   k
                            k
 Only Sm operations (S=#support vectors)
SVM Kernel Functions

• K(a,b)=(a . b +1)d is an example of an SVM
  Kernel Function
• Beyond polynomials there are other very
  high dimensional basis functions that can be
  made practical by finding the right Kernel
  Function
  – Radial-Basis-style Kernel Function:
                            (a − b)
                               2
                                           
             K (a, b) = exp −
                                          
                                              σ, κ and δ are magic
                              2σ 2           parameters that must
  – Neural-net-style Kernel Function:          be chosen by a model
             K (a, b) = tanh(κ a.b − δ )       selection method
                                               such as CV or VCSRM
SVM Implementations


• Sequential Minimal Optimization, SMO,
  efficient implementation of SVMs, Platt
  – in Weka
• SVMlight
  – http://svmlight.joachims.org/
References

• Tutorial on VC-dimension and Support
  Vector Machines:
    C. Burges. A tutorial on support vector machines for
     pattern recognition. Data Mining and Knowledge
     Discovery, 2(2):955-974, 1998.
     http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible:
    Statistical Learning Theory by Vladimir Vapnik, Wiley-
     Interscience; 1998

More Related Content

What's hot

Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Simplilearn
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revisedKrish_ver2
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning TechniquesBabu Priyavrat
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extractionskylian
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentationRishavSharma112
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERKnoldus Inc.
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)Sharayu Patil
 
Grid search (parameter tuning)
Grid search (parameter tuning)Grid search (parameter tuning)
Grid search (parameter tuning)Akhilesh Joshi
 
Linear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in MLLinear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in MLKumud Arora
 
Support vector machine
Support vector machineSupport vector machine
Support vector machineRishabh Gupta
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...Simplilearn
 

What's hot (20)

Decision theory
Decision theoryDecision theory
Decision theory
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning Techniques
 
KNN
KNN KNN
KNN
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extraction
 
Module 31
Module 31Module 31
Module 31
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
NAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
 
Grid search (parameter tuning)
Grid search (parameter tuning)Grid search (parameter tuning)
Grid search (parameter tuning)
 
Linear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in MLLinear Regression and Logistic Regression in ML
Linear Regression and Logistic Regression in ML
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Module 4 part_1
Module 4 part_1Module 4 part_1
Module 4 part_1
 

Similar to Svm

Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Beniamino Murgante
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesguestfee8698
 
Fuzzy logic-introduction
Fuzzy logic-introductionFuzzy logic-introduction
Fuzzy logic-introductionWBUTTUTORIALS
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer visionzukun
 
Decision Trees and Bayes Classifiers
Decision Trees and Bayes ClassifiersDecision Trees and Bayes Classifiers
Decision Trees and Bayes ClassifiersAlexander Jung
 
Notes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVMNotes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVMSyedSaimGardezi
 
CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to...
CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to...CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to...
CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to...zukun
 
Lecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primalLecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primalStéphane Canu
 
Machine Learning in Speech and Language Processing
Machine Learning in Speech and Language ProcessingMachine Learning in Speech and Language Processing
Machine Learning in Speech and Language Processingbutest
 
Lesson 6: Limits Involving Infinity (slides)
Lesson 6: Limits Involving Infinity (slides)Lesson 6: Limits Involving Infinity (slides)
Lesson 6: Limits Involving Infinity (slides)Mel Anthony Pepito
 
Lesson 6: Limits Involving Infinity (slides)
Lesson 6: Limits Involving Infinity (slides)Lesson 6: Limits Involving Infinity (slides)
Lesson 6: Limits Involving Infinity (slides)Matthew Leingang
 
Fuzzy sets
Fuzzy setsFuzzy sets
Fuzzy setsVeni7
 
Introduction to Support Vector Machines
Introduction to Support Vector MachinesIntroduction to Support Vector Machines
Introduction to Support Vector MachinesSilicon Mentor
 
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vectorDr Fereidoun Dejahang
 
Fuzzy Set Theory
Fuzzy Set TheoryFuzzy Set Theory
Fuzzy Set TheoryAMIT KUMAR
 
properties, application and issues of support vector machine
properties, application and issues of support vector machineproperties, application and issues of support vector machine
properties, application and issues of support vector machineDr. Radhey Shyam
 

Similar to Svm (20)

[ML]-SVM2.ppt.pdf
[ML]-SVM2.ppt.pdf[ML]-SVM2.ppt.pdf
[ML]-SVM2.ppt.pdf
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Fuzzy logic-introduction
Fuzzy logic-introductionFuzzy logic-introduction
Fuzzy logic-introduction
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
Decision Trees and Bayes Classifiers
Decision Trees and Bayes ClassifiersDecision Trees and Bayes Classifiers
Decision Trees and Bayes Classifiers
 
Notes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVMNotes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVM
 
CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to...
CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to...CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to...
CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to...
 
Lecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primalLecture 1: linear SVM in the primal
Lecture 1: linear SVM in the primal
 
Machine Learning in Speech and Language Processing
Machine Learning in Speech and Language ProcessingMachine Learning in Speech and Language Processing
Machine Learning in Speech and Language Processing
 
Gentle intro to SVM
Gentle intro to SVMGentle intro to SVM
Gentle intro to SVM
 
Svm my
Svm mySvm my
Svm my
 
Svm my
Svm mySvm my
Svm my
 
Lesson 6: Limits Involving Infinity (slides)
Lesson 6: Limits Involving Infinity (slides)Lesson 6: Limits Involving Infinity (slides)
Lesson 6: Limits Involving Infinity (slides)
 
Lesson 6: Limits Involving Infinity (slides)
Lesson 6: Limits Involving Infinity (slides)Lesson 6: Limits Involving Infinity (slides)
Lesson 6: Limits Involving Infinity (slides)
 
Fuzzy sets
Fuzzy setsFuzzy sets
Fuzzy sets
 
Introduction to Support Vector Machines
Introduction to Support Vector MachinesIntroduction to Support Vector Machines
Introduction to Support Vector Machines
 
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector
 
Fuzzy Set Theory
Fuzzy Set TheoryFuzzy Set Theory
Fuzzy Set Theory
 
properties, application and issues of support vector machine
properties, application and issues of support vector machineproperties, application and issues of support vector machine
properties, application and issues of support vector machine
 

Recently uploaded

social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 

Recently uploaded (20)

social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 

Svm

  • 2. SVMs • Geometric – Maximizing Margin • Kernel Methods – Making nonlinear decision boundaries linear – Efficiently! • Capacity – Structural Risk Minimization
  • 3. SVM History • SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis • SVM was first introduced by Boser, Guyon and Vapnik in COLT-92 • SVM became famous when, using pixel maps as input, it gave accuracy comparable to NNs with hand-designed features in a handwriting recognition task • SVM is closely related to: – Kernel machines (a generalization of SVMs), large margin classifiers, reproducing kernel Hilbert space, Gaussian process, Boosting
  • 4. α Linear Classifiers x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 How would you classify this data?
  • 5. α Linear Classifiers x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 How would you classify this data?
  • 6. α Linear Classifiers x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 How would you classify this data?
  • 7. α Linear Classifiers x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 How would you classify this data?
  • 8. α Linear Classifiers x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 Any of these would be fine.. ..but which is best?
  • 9. α Classifier Margin x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
  • 10. α Maximum Margin x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM
  • 11. α Maximum Margin x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier Support Vectors with the, um, are those datapoints that maximum margin. the margin This is the pushes up against simplest kind of SVM (Called an LSVM) Linear SVM
  • 12. Why Maximum Margin? 1. Intuitively this feels safest. f(x,w,b) = sign(w. x - b) 2. If we’ve made a small error in the denotes +1 location of the boundary (it’s been denotes -1 The maximum jolted in its perpendicular direction) this gives us least chance linear margin of causing a misclassification. classifier is the linear classifier 3. There’s some theory (using VC Support Vectors dimension) that is related to um, not with the, (but are those the same as) the proposition that this datapoints that maximum margin. is a good thing. the margin This is the 4. Empirically it works very very well. pushes up against simplest kind of SVM (Called an LSVM)
  • 13. A “Good” Separator O O X X X O X O O X X O X O X O
  • 14. Noise in the Observations O O X X X O X O O X X O X O X O
  • 15. Ruling Out Some Separators O O X X X O X O O X X O X O X O
  • 16. Lots of Noise O O X X X O X O O X X O X O X O
  • 17. Maximizing the Margin O O X X X O X O O X X O X O X O
  • 18. Specifying a line and margin ” = +1 Plus-Plane s C las e Classifier Boundary dict zon “ Pre Minus-Plane = -1” s C l as dict zone “ Pre • How do we represent this mathematically? • …in m input dimensions?
  • 19. Specifying a line and margin ” = +1 Plus-Plane s C las e Classifier Boundary ict zon “Pred Minus-Plane = -1” s =1 C l as wx +b =0 dict zone + b -1 w x b= “ Pre + wx • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Classify +1 if w . x + b >= 1 as.. -1 if w . x + b <= -1 Universe if -1 < w . x + b < 1 explodes
  • 20. Computing the margin width ” = +1 M = Margin Width s C las e ict zon “Pred = -1” How do we compute s =1 C l as M in terms of w wx +b =0 dict zone + b -1 w x b= “ Pre and b? + wx • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why?
  • 21. Computing the margin width ” = +1 M = Margin Width s C las e ict zon “Pred = -1” How do we compute s =1 C l as M in terms of w wx +b =0 dict zone + b -1 w x b= “ Pre and b? + wx • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? Let u and v be two vectors on the Plus Plane. What is w . ( u – v ) ? And so of course the vector w is also perpendicular to the Minus Plane
  • 22. Computing the margin width ” = +1 M = Margin Width s x+ C las e ict zon “Pred =- - 1” How do we compute x =1 l as s M in terms of w +b i ct C zone wx =0 + b -1 “Pred and b? w x b= + wx • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } • The vector w is perpendicular to the Plus Plane in Any location R : not A : not m m • Let x- be any point on the minus plane necessarily a datapoint • Let x+ be the closest plus-plane-point to x-.
  • 23. Computing the margin width ” = +1 M = Margin Width s x+ C las e ict zon “Pred =- - 1” How do we compute x =1 l as s M in terms of w +b i ct C zone wx =0 + b -1 “Pred and b? w x b= + wx • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } • The vector w is perpendicular to the Plus Plane • Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x-. • Claim: x+ = x- + λ w for some value of λ. Why?
  • 24. Computing the margin width ” = +1 M = Margin Width s x+ C las e ict zon The line from x- to x+ is “Pred -- 1” How do we compute perpendicular to the = ss x planes.terms of w M in b= 1 Cla e ct zon + i wx =0 + b -1 “Pred So to getbfrom x- to x+ and ? w x b= + wx travel some distance in • Plus-plane = { x : w . x + b =direction w. +1 } • Minus-plane = { x : w . x + b = -1 } • The vector w is perpendicular to the Plus Plane • Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x-. • Claim: x+ = x- + λ w for some value of λ. Why?
  • 25. Computing the margin width ” = +1 M = Margin Width s x+ C las e ict zon “Pred 1” x =- - =1 l as s +b i ct C zone wx =0 + b -1 “Pred w x b= + wx What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + λ w • |x+ - x- | = M It’s now easy to get M in terms of w and b
  • 26. Computing the margin width ” = +1 M = Margin Width s x+ C las e ict zon “Pred 1” x =- - =1 l as s +b i ct C zone wx =0 + b -1 “Pred w . (x - + λ w) + b = 1 w x b= + wx => What we know: w . x + b + λ w .w = 1 - • w . x+ + b = +1 => • w . x- + b = -1 -1 + λ w .w = 1 • x+ = x- + λ w => 2 λ= • |x+ - x- | = M w.w It’s now easy to get M in terms of w and b
  • 27. Computing the margin width ” 2 = +1 M = Margin Width = s x+ C las e w.w ict zon “Pred 1” x =- - =1 l as s +b i ct C zone wx =0 + b -1 “Pred M= |x+ - x- | =| λ w |= w x b= + wx What we know: = λ | w | = λ w.w • w . x+ + b = +1 2 w.w 2 • w . x- + b = -1 = = w.w w.w • x+ = x- + λ w • |x+ - x- | = M 2 • λ= w.w
  • 28. Learning the Maximum Margin Classifier ” 2 = +1 M = Margin Width = s x+ C las e w.w ict zon “Pred 1” x =- - =1 l as s +b i ct C zone wx =0 + b -1 “Pred w x b= + wx Given a guess of w and b we can • Compute whether all data points in the correct half-planes • Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method?
  • 29. Don’t worry… it’s good for you… • Linear Programming find w argmax c⋅w subject to w⋅ai ≤ bi, for i = 1, …, m wj ≥ 0 for j = 1, …, n There are fast algorithms for solving linear programs including the simplex algorithm and Karmarkar’s algorithm
  • 30. Learning via Quadratic Programming • QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.
  • 31. Quadratic Programming uT Ru Find arg max c + dT u + Quadratic criterion u 2 Subject to a11u1 + a12u 2 + ... + a1mum ≤ b1 a21u1 + a22u 2 + ... + a2 mu m ≤ b2 n additional linear inequality : constraints an1u1 + an 2u 2 + ... + anmu m ≤ bn And subject to e additional linear a( n +1)1u1 + a( n +1) 2u2 + ... + a( n +1) mu m = b( n +1) constraints equality a( n + 2)1u1 + a( n + 2 ) 2u2 + ... + a( n + 2 ) mu m = b( n + 2 ) : a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e )
  • 32. Quadratic Programming uT Ru Find arg max c + dT u + Quadratic criterion u 2 Subject to a11u1 + a12u 2 + ... + a1mum ≤ b1g in for find re ex u lg ... h e s qu m ≤ ic a21u1 + a22ist2a+ orit+ md2 muadratb2 a n additional linear Th e constra in tl y s uch : ore efficien inequality much m n gradient constraints optima a lia...y th u ≤ b bl an1u1 + ann2d re+ ascennm m a u2 + a t. n ou And subject to very f iddly…y e e additional linear are+ ... + a t to writ = b a( n +1)1u1 (+ ut( nh1)y u2 n’t wan( n +1) mu m B a t + e 2 do ( n +1) constraints equality ly probab yoursel f) a u +a ( n + 2 )1 1 uon+ ... + a e ( n+ 2) 2 2 u =b ( n+ 2) m m ( n+2) : a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e )
  • 33. Learning the Maximum Margin Classifier ” M= = +1 2 Given guess of w , b we s C las e w.w can dict zon “ Pre • Compute whether all = -1” s =1 C l as data points are in the wx +b 0 dict zone + b= wx b=-1 “ Pre correct half-planes + wx • Compute the margin width What should our How many constraints Assume R datapoints, quadratic optimization each (xhave? R yk = will we k,yk) where criterion be? What 1 +/- should they be? Minimize w.w w . xk + b >= 1 if yk = 1 w . xk + b <= -1 if yk = -1
  • 34. This is going to be a problem! Uh-oh! What should we do? denotes +1 Idea 1: denotes -1 Find minimum w.w, while minimizing number of training set errors. Problem: Two things to minimize makes for an ill-defined optimization
  • 35. This is going to be a problem! Uh-oh! What should we do? denotes +1 Idea 1.1: denotes -1 Minimize w.w + C (#train errors) Tradeoff parameter There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is?
  • 36. This is going to be a problem! Uh-oh! What should we do? denotes +1 Idea 1.1: denotes -1 Minimize w.w + C (#train errors) Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. There’s a serious practical (Also, doesn’t distinguish between problem that’s abouteyr make disastrous errors and near misses) n So… a to o th us reject this approach. Can ideas? you guess what it is?
  • 37. This is going to be a problem! Uh-oh! What should we do? denotes +1 Idea 2.0: denotes -1 Minimize w.w + C (distance of error points to their correct place)
  • 38. Learning Maximum Margin with Noise M= 2 Given guess of w , b we w.w can • Compute sum of b =1 distances of points to w x+ 0 + b= wx b=-1 their correct zones + wx • Compute the margin width What should our Assume R datapoints, quadratic optimization each (xk,yk) where yk = criterion be? +/- 1 How many constraints will we have? What should they be?
  • 39. Large-margin Decision Boundary • The decision boundary should be as far away from the data of both classes as possible – We should maximize the margin, m – Distance between the origin and the line wtx=k is k/||w|| Class 2 m Class 1 05/18/12 39
  • 40. Finding the Decision Boundary • Let {x1, ..., xn} be our data set and let yi ∈ {1,-1} be the class label of xi • The decision boundary should classify all points correctly ⇒ • The decision boundary can be found by solving the following constrained optimization problem • This is a constrained optimization problem. Solving it requires some new tools – Feel free to ignore the following several slides; what is important is 05/18/12 constrained optimization problem above the 40
  • 41. Back to the Original Problem • The Lagrangian is – Note that ||w||2 = wTw • Setting the gradient of w.r.t. w and b to zero, we have 05/18/12 41
  • 42. • The Karush-Kuhn-Tucker conditions, 05/18/12 42
  • 43. The Dual Problem • If we substitute to , we have • Note that • This is a function of αi only 05/18/12 43
  • 44. The Dual Problem • The new objective function is in terms of αi only • It is known as the dual problem: if we know w, we know all αi; if we know all αi, we know w • The original problem is known as the primal problem • The objective function of the dual problem needs to be maximized! • The dual problem is therefore: Properties of αi when we introduce The result when we differentiate the the Lagrange multipliers original Lagrangian w.r.t. b 44
  • 45. The Dual Problem • This is a quadratic programming (QP) problem – A global maximum of αi can always be found • w can be recovered by 05/18/12 45
  • 46. Characteristics of the Solution • Many of the αi are zero – w is a linear combination of a small number of data points – This “sparse” representation can be viewed as data compression as in the construction of knn classifier • xi with non-zero αi are called support vectors (SV) – The decision boundary is determined only by the SV – Let tj (j=1, ..., s) be the indices of the s support vectors. We can write • For testing with a new data z – Compute and classify z as class 1 if the sum is positive, and class 2 otherwise – Note: w need not be formed explicitly 05/18/12 46
  • 47. A Geometrical Interpretation Class 2 α10=0 α8=0.6 α7=0 α2=0 α5=0 α1=0.8 α4=0 α6=1.4 α9=0 α3=0 Class 1 05/18/12 47
  • 48. Non-linearly Separable Problems • We allow “error” ξi in classification; it is based on the output of the discriminant function wTx+b • ξi approximates the number of misclassified samples Class 2 Class 1 05/18/12 48
  • 49. Learning Maximum Margin with Noise M= ε11 2 Given guess of w , b we ε2 w.w can • Compute sum of +b =1 distances of points to wx 0 ε7 + b= wx b=-1 their correct zones + wx • Compute the margin width What should our How many R datapoints, Assume constraints will quadratic optimization we have?k,yk) where yk = each (x R criterion be? R What should they be? +/- 1 1 Minimize w.w + C ∑ εk w . x + b >= 1-ε if y = 1 2 k =1 k k k w . xk + b <= -1+εk if yk = -1
  • 50. Learning Maximum Margin with Noise m = # input M= dimensions 2 Given guess of w , b we ε11 ε2 w.w can • Compute sum m+1 Our original (noiseless data) QP had of 1 variables: w1, w2, … wm, and of points to distances b. b= wx + 0 ε7 Our new (noisy data) QP correct zones their has m+1+R + b= wx b=-1 + wx variables: w1, • 2, … wm, b, εthe1margin w Compute k , ε ,… εR width What should our How many R datapoints, Assume constraints records R= # will quadratic optimization we have?k,yk) where yk = each (x R criterion be? R What should they be? +/- 1 1 Minimize w.w + C ∑ εk w . x + b >= 1-ε if y = 1 2 k =1 k k k w . xk + b <= -1+εk if yk = -1
  • 51. Learning Maximum Margin with Noise M= ε11 2 Given guess of w , b we ε2 w.w can • Compute sum of +b =1 distances of points to wx 0 ε7 + b= wx b=-1 their correct zones + wx • Compute the margin width What should our How many R datapoints, Assume constraints will quadratic optimization we have?k,yk) where yk = each (x R criterion be? R What should they be? +/- 1 1 Minimize w.w + C ∑ εk w . x + b >= 1-ε if y = 1 2 k =1 k k k There’s a bug in this QP. Can you kspot it? -1+εk if yk = -1 w . x + b <=
  • 52. Learning Maximum Margin with Noise M= ε11 2 Given guess of w , b we ε2 w.w can • Compute sum of +b =1 distances of points to wx 0 ε7 + b= wx b=-1 their correct zones + wx • Compute the margin width How many constraints will What should our Assume R 2R we have? datapoints, quadratic optimization What should they be? yk = each (xk,yk) where criterion be? 1 R w . xk +1b >= 1-εk if yk = 1 +/- Minimize w.w + C ∑ εk 2 k =1 w . xk + b <= -1+εk if yk = -1 εk >= 0 for all k
  • 53. Learning Maximum Margin with Noise M= ε11 2 Given guess of w , b we ε2 w.w can • Compute sum of +b =1 distances of points to wx 0 ε7 + b= wx b=-1 their correct zones + wx • Compute the margin width How many constraints will What should our Assume R 2R we have? datapoints, quadratic optimization What should they be? yk = each (xk,yk) where criterion be? 1 R w . xk +1b >= 1-εk if yk = 1 +/- Minimize w.w + C ∑ εk 2 k =1 w . xk + b <= -1+εk if yk = -1 εk >= 0 for all k
  • 54. An Equivalent Dual QP Minimize R w . xk + b >= 1-εk if yk = 1 1 w.w + C ∑ εk w . xk + b <= -1+εk if yk = -1 2 k =1 εk >= 0 , for all k R 1 R R Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl ( x k .x l ) k =1 2 k =1 l =1 R Subject to these constraints: 0 ≤ αk ≤ C ∀k ∑α k=1 k yk = 0
  • 55. An Equivalent Dual QP R 1 R R Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl ( x k .x l ) k =1 2 k =1 l =1 R Subject to these constraints: 0 ≤ αk ≤ C ∀k ∑α k=1 k yk = 0 Then define: R w = ∑ k yk x k α Then classify with: k=1 f(x,w,b) = sign(w. x - b) b = y K (1 −ε K ) −x K .w K where K = arg max αk k
  • 56. Example XOR problem revisited: Let the nonlinear mapping be : φ(x) = (1,x12, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T And: φ(xi)=(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T Therefore the feature space is in 6D with input data in 2D x1 = (-1,-1), d1= - 1 x2 = (-1, 1), d2 = 1 x3 = ( 1,-1), d3 = 1 x4 = (-1,-1), d4= -1
  • 57. Q(a)= Σ ai – ½ Σ Σ ai aj di dj φ(xi) Tφ(xj) =a1 +a2 +a3 +a4 – ½(9 a1 a1 - 2a1 a2 -2 a1 a3 +2a1 a4 +9a2 a2 + 2a2 a3 -2a2 a4 +9a3 a3 -2a3 a4 +9 a4 a4 ) To minimize Q, we only need to calculate ∂Q ( a ) ∂a i = 0, i = 1,...,4 (due to optimality conditions) which gives 1 = 9 a1 - a2 - a3 + a4 1 = -a1 + 9 a2 + a3 - a4 1 = -a1 + a2 + 9 a3 - a4 1 = a1 - a2 - a3 + 9 a4
  • 58. The solution of which gives the optimal values: a0,1 =a0,2 =a0,3 =a0,4 =1/8 w0 = Σ a0,i di φ(xi) = 1/8[φ(x1)- φ(x2)- φ(x3)+ φ(x4)]   1   1   1   1   0               1   1   1   1   0  1   2  − 2 − 2  2   − 1  = −  + − −  =  2 8  1   1   1   1   0   − 2 − 2  2     2  0           − 2  2  − 2             2   0    Where the first element of w0 gives the bias b
  • 59. From earlier we have that the optimal hyperplane is defined by: w0T φ(x) = 0 That is:  1   2   x1   2x x   0 0 0  w0T φ(x) =  0 0 − 1  = − x1 x2 = 0 1 2  2  2  x2   2x   1   2x   2  which is the optimal decision boundary for the XOR problem. Furthermore we note that the solution is unique since the optimal decision boundary is unique
  • 61. Harder 1-dimensional dataset Remember how permitting non- linear basis functions made linear regression so much nicer? Let’s permit them here too x=0 z k = ( xk , xk ) 2
  • 62. For a non-linearly separable problem we have to first map data onto feature space so that they are linear separable xi φ(xi) Given the training data sample {(xi,yi), i=1, …,N}, find the optimum values of the weight vector w and bias b w = Σ a0,i yi φ(xi) where a0,i are the optimal Lagrange multipliers determined by maximizing the following objective function N 1 N N Q ( a ) = ∑ai − ∑∑ai a j d i d j φ ( x i )φ( x j ) T i=1 2 i =1 j =1 subject to the constraints Σ ai yi =0 ; ai >0
  • 63. SVM building procedure: • Pick a nonlinear mapping φ • Solve for the optimal weight vector However: how do we pick the function φ? • In practical applications, if it is not totally impossible to find φ, it is very hard • In the previous example, the function φ is quite complex: How would we find it? Answer: the Kernel Trick
  • 64. Notice that in the dual problem the image of input vectors only involved as an inner product meaning that the optimization can be performed in the (lower dimensional) input space and that the inner product can be replaced by an inner-product kernel N N Q ( a ) = ∑ i − 1 ∑ai a j d i d jϕj ( x )T ϕj ( x i ) a i=1 2 i , j =1 N N = ∑ i − 1 ∑ai a j d i d j K ( xi , x j ) a i=1 2 i , j =1 How do we relate the output of the SVM to the kernel K? Look at the equation of the boundary in the feature space and use the optimality conditions derived from the Lagrangian formulations
  • 65. Hyperplane is defined by m1 ∑w ϕ ( x ) + b = 0 j =1 j j or m1 ∑w ϕ ( x ) = 0; j =0 j j where ϕ0 ( x ) = 1 writing : ϕ( x ) = [ϕ0 ( x ), ϕ1 ( x ),..., ϕm1 ( x )] w ϕ( x ) = 0 T we get : N from optimality conditions : w = ∑ai d i ϕ( x i ) i =1
  • 66. N ∑a d ϕ ( x i )ϕ( x ) = 0 T Thus : i i i=1 N and so boundary is : ∑ i d i K ( x, xi ) = 0 a i=1 N and Output = w ϕ( x ) = ∑ i d i K ( x, xi ) T a i=1 m1 where : K ( x , xi ) = ∑ ϕj ( x )ϕj ( x i ) j =0
  • 67. In the XOR problem, we chose to use the kernel function: K(x, xi) = (x T xi+1)2 = 1+ x12 xi12 + 2 x1x2 xi1xi2 + x22 xi22 + 2x1xi1 ,+ 2x2xi2 Which implied the form of our nonlinear functions: φ(x) = (1,x12, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T And: φ(xi)=(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T However, we did not need to calculate φ at all and could simply have used the kernel to calculate: Q(a) = Σ ai – ½ Σ Σ ai aj di dj Κ(xi, xj) Maximized and solved for ai and derived the hyperplane via: N ∑a d K ( x, x ) = 0 i =1 i i i
  • 68. We therefore only need a suitable choice of kernel function cf: Mercer’s Theorem: Let K(x,y) be a continuous symmetric kernel that defined in the closed interval [a,b]. The kernel K can be expanded in the form Κ (x,y) = φ(x) T φ(y) provided it is positive definite. Some of the usual choices for K are: Polynomial SVM (x T xi+1)p p specified by user RBF SVM exp(-1/(2σ2) || x – xi||2) σ specified by user MLP SVM tanh(s0 x T xi + s1)
  • 69. An Equivalent Dual QP R 1 R R Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl ( x k .x l ) k =1 2 k =1 l =1 R Subject to these constraints: 0 ≤ αk ≤ C ∀k ∑α k=1 k yk = 0 Then define: Datapoints with αk > 0 R will be the support w = ∑ k yk x k α vectors classify with: Then k=1 f(x,w,b) = sign(w. x - b) ..so this sum only needs b = y K (1 −ε K ) −x K .w K to be over the support vectors. where K = arg max αk k
  • 70.  1   Constant Term Quadratic  2 x1    2 x2   Linear Terms Basis  :    2 xm   Functions  x12   2  Pure Number of terms (assuming m input  x2  Quadratic dimensions) = (m+2)-choose-2  :    Terms xm2 = (m+2)(m+1)/2   Φ(x) =  2 x1 x2  = (as near as makes no difference) m2/2    2 x1 x3   :     2 x1 xm  You may be wondering what those   Quadratic  2 x2 x3  Cross-Terms 2 ’s are doing.  :    •You should be happy that they do no  2 x1 xm  harm  :    2x x   •You’ll find out why they’re there  m −1 m  soon.
  • 71. QP with basis functions R 1 R R Maximize ∑ αk − 2 ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(xl )) k =1 k =1 l =1 R Subject to these constraints: 0 ≤ αk ≤ C ∀k ∑α k=1 k yk = 0 Then define: w= ∑ k y k Φ( x k ) α Then classify with: k s.t. α k >0 f(x,w,b) = sign(w. φ (x) - b) b = y K (1 −ε K ) −x K .w K where K = arg max αk k
  • 72. QP with basis functions R 1 R R Maximize ∑ αk − 2 ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(xl )) k =1 k =1 l =1 We must do R2/2 dot products to get this matrix ready. R Subject to these 0 ≤ αk ≤ C ∀k product ∑ k m k = 0 Each dot α y requires /2 2 constraints: k= 1 additions and multiplications The whole thing costs R2 m2 /4. Then define: Yeeks! w= ∑ k y k Φ( x k ) α Then classify with: or does it? … k s.t. α k >0 f(x,w,b) = sign(w. φ (x) - b) b = y K (1 −ε K ) −x K .w K where K = arg max αk k
  • 73. 1   1  1      2a1   2b1   + 2 a2   2b2      m  :   :  ∑ 2a b i i  2 am   2bm  i =1      a12   b1 2  +  2   2   a2   b2  m  :   :  ∑ ai2bi2  2   2  i =1  am   bm  Φ(a) • Φ(b) =   2a1a2  •   2b1b2   Quadratic  2a1a3   2b1b3  +   :     :   Dot Products  2a1am   2b1bm      m m  2a2 a3   2b2b3  ∑ ∑ 2a a b b i j i j  :   :  i =1 j =i +1      2a1am   2b1bm   :   :    2a a   2b b      m −1 m   m −1 m 
  • 74. Just out of casual, innocent, interest, let’s look at another function of a and Quadratic b: (a.b + 1) 2 Dot Products = (a.b) 2 + 2a.b + 1 2  m  m =  ∑ ai bi  + 2∑ ai bi + 1  i =1  i =1 Φ(a) • Φ(b) = m m m m m m m = ∑∑ ai bi a j b j + 2∑ ai bi + 1 1 + 2∑ ai bi + ∑ a b + ∑ 2 2 i i ∑ 2a a b bi j i j i =1 j =1 i =1 i =1 i =1 i =1 j =i +1 m m m m = ∑ (ai bi ) + 2∑ 2 ∑a b a b i i j j + 2∑ ai bi + 1 i =1 i =1 j =i +1 i =1
  • 75. Just out of casual, innocent, interest, let’s look at another function of a and Quadratic b: (a.b + 1) 2 Dot Products = (a.b) 2 + 2a.b + 1 2  m  m =  ∑ ai bi  + 2∑ ai bi + 1  i =1  i =1 Φ(a) • Φ(b) = m m m m m m m = ∑∑ ai bi a j b j + 2∑ ai bi + 1 1 + 2∑ ai bi + ∑ a b + ∑ 2 2 i i ∑ 2a a b bi j i j i =1 j =1 i =1 i =1 i =1 i =1 j =i +1 m m m m = ∑ (ai bi ) + 2∑ 2 ∑a b a b i i j j + 2∑ ai bi + 1 i =1 i =1 j =i +1 i =1 They’re the same! And this is only O(m) to compute!
  • 76. QP with Quadratic basis functions R 1 R R Maximize ∑ αk − 2 ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(xl )) k =1 k =1 l =1 We must do R2/2 dot products to get this matrix ready. R Subject to these 0 ≤ αk ≤ C ∀k product ∑ k y k = 0 Each dot α now only requires constraints: k=1 m additions and multiplications Then define: w= ∑ k y k Φ( x k ) α Then classify with: k s.t. α k >0 f(x,w,b) = sign(w. φ (x) - b) b = y K (1 −ε K ) −x K .w K where K = arg max αk k
  • 77. Higher Order Polynomials Polynomial φ (x) Cost to build Cost if 100 φ (a).φ (b) Cost to Cost if Qkl matrix inputs build Qkl 100 traditionally matrix inputs efficiently Quadratic All m2/2 m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2 terms up to degree 2 Cubic All m3/6 m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2 terms up to degree 3 Quartic All m4/24 m4 R2 /48 1,960,000 (a.b+1)4 m R2 / 2 50 R2 terms up to R2 degree 4
  • 78. QP with Quintic basis functions R R R Maximize ∑ α + ∑∑ α α Q k =1 k k =1 l =1 k l kl where Qkl = yk yl (Φ(x k ).Φ(x l )) R Subject to these constraints: 0 ≤ αk ≤ C ∀k ∑α k=1 k yk = 0 Then define: w= ∑α k s.t. α k >0 k y k Φ( x k ) b = y K (1 −ε K ) −x K .w K Then classify with: where K = arg max αk k f(x,w,b) = sign(w. φ (x) - b)
  • 79. QP with Quintic basis functions We must do RR/2 dot products to get this 2 R R ∑ α + ∑∑ α α Q matrix ready. Maximize k k l kl In 100-d, each 1 product now needs 103 k = dot k =1 l =1 where Qkl = yk yl (Φ(x k ).Φ(x l )) operations instead of 75 million R But there are still worrying things lurking away. Subject to these What are they? constraints: 0 ≤ α ≤ C ∀k k ∑α ky =0 k •The fear of overfittingk = this enormous with 1 number of terms Then define: •The evaluation phase (doing a set of predictions on a test set) will be very ∑ expensive (why?) w= α k y k Φ( x k ) k s.t. α k >0 b = y K (1 −ε K ) −x K .w K Then classify with: where K = arg max αk k f(x,w,b) = sign(w. φ (x) - b)
  • 80. QP with Quintic basis functions We must do RR/2 dot products to get this 2 R R ∑ α + ∑∑ α α Q matrix ready. Maximize k k l kl In 100-d, each 1 product now needs 103 k = dot k =1 l =1 where Qkl = yk yl (Φ(x k ).Φ(x l )) The use of Maximum Margin operations instead of 75 million magically makes this not a problem R But there are still worrying things lurking away. Subject to these What are they? constraints: 0 ≤ α ≤ C ∀k k ∑α ky =0 k •The fear of overfittingk = this enormous with 1 number of terms Then define: •The evaluation phase (doing a set of predictions on a test set) will be very ∑ expensive (why?) w= α k y k Φ( x k ) k s.t. α k >0 Because each w. φ (x) (see below) needs 75 million operations. What b = y K (1 −ε K ) −x K .w K can be done? Then classify with: where K = arg max αk k f(x,w,b) = sign(w. φ (x) - b)
  • 81. QP with Quintic basis functions We must do RR/2 dot products to get this 2 R R ∑ α + ∑∑ α α Q matrix ready. Maximize k k l kl In 100-d, each 1 product now needs 103 k = dot k =1 l =1 where Qkl = yk yl (Φ(x k ).Φ(x l )) The use of Maximum Margin operations instead of 75 million magically makes this not a problem R But there are still worrying things lurking away. Subject to these What are they? constraints: 0 ≤ α ≤ C ∀k k ∑α k y =0 k •The fear of overfittingk = this enormous with 1 number of terms Then define: •The evaluation phase (doing a set of predictions on a test set) will be very ∑ expensive (why?) w= α k y k Φ( x k ) k s.t. α k >0 Because each w. φ (x) (see below) needs 75 million operations. What b Φ x ( ∑ ) − x ⋅Φ w ⋅=(y) =1 −ε k y k Φ(xk ) .w( x) K α K k s.t. α k >0 K K can be done? Then classify with: where ∑ k yk ( max α =K = argx k ⋅ x +1) 5 α k s.t. α k >0 k k Only Sm operations (S=#support vectors) f(x,w,b) = sign(w. φ (x) - b)
  • 82. QP with Quintic basis functions We must do RR/2 dot products to get this 2 R R ∑ α + ∑∑ α α Q matrix ready. Maximize k k l kl In 100-d, each 1 product now needs 103 k = dot k =1 l =1 where Qkl = yk yl (Φ(x k ).Φ(x l )) The use of Maximum Margin operations instead of 75 million magically makes this not a problem R But there are still worrying things lurking away. Subject to these What are they? constraints: 0 ≤ α ≤ C ∀k k ∑α k y =0 k •The fear of overfittingk = this enormous with 1 number of terms Then define: •The evaluation phase (doing a set of predictions on a test set) will be very ∑ expensive (why?) w= α k y k Φ( x k ) k s.t. α k >0 Because each w. φ (x) (see below) needs 75 million operations. What b Φ x ( ∑ ) − x ⋅Φ w ⋅=(y) =1 −ε k y k Φ(xk ) .w( x) K α K k s.t. α k >0 K K can be done? Then classify with: where ∑ k yk ( max α =K = argx k ⋅ x +1) 5 α k s.t. α k >0 k k Only Sm operations (S=#support vectors) f(x,w,b) = sign(w. φ (x) - b)
  • 83. QP with Quintic basis functions R 1 R R Maximize ∑ αk − 2 ∑∑ αk αl Qkl where Qthink:don’tyoverfit(as kmuch asxl )) k =1 k =1 l =1 you’d kl = Why SVMs y k l (Φ x ).Φ( No matter what the basis function, there are really R up to R only Subject to these constraints: 0 ≤ αk ≤ C ∀k ∑α Ry =0 parameters: α1, α2 .. αk, and usually k most are set to zero by the Maximum k=1 Margin. Then define: Asking for small w.w is like “weight decay” in Neural Nets and like Ridge Regression parameters in Linear w= ∑α k s.t. α k >0 k y Φ( x ) k k regression and like the use of Priors in Bayesian Regression---all designed to smooth the function and reduce w ⋅Φ( x) = ∑ b = y (1 −ε ) −x .wαk y k Φ( x k ) ⋅Φ( x) overfitting. K k s.t. α k >0 K K K Then classify with: ∑ =K = argx k ⋅ x +1) 5 where k s.t. α >0k yk ( max αk α f(x,w,b) = sign(w. φ (x) - b) k k Only Sm operations (S=#support vectors)
  • 84. SVM Kernel Functions • K(a,b)=(a . b +1)d is an example of an SVM Kernel Function • Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function – Radial-Basis-style Kernel Function:  (a − b) 2  K (a, b) = exp −    σ, κ and δ are magic  2σ 2  parameters that must – Neural-net-style Kernel Function: be chosen by a model K (a, b) = tanh(κ a.b − δ ) selection method such as CV or VCSRM
  • 85. SVM Implementations • Sequential Minimal Optimization, SMO, efficient implementation of SVMs, Platt – in Weka • SVMlight – http://svmlight.joachims.org/
  • 86. References • Tutorial on VC-dimension and Support Vector Machines: C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html • The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley- Interscience; 1998

Editor's Notes

  1. Let x(1) and x(-1) be two S.V. Then b = -1/2( w^T x(1) + w^T x(-1) )
  2. So, if change internal points, no effect on the decision boundary