support vector machine

Submitted by:
Garisha Chowdhary ,
MCSE 1st year,
Jadavpur University

A set of related supervised learning methods

Non-probablistic binary linear classifier

Linear learners like perceptrons but unlike them uses concept of :
maximum margin ,linearization and kernel function

Used for classification and regression analysis

Map non-lineraly separable
Select between hyper
instances to higher
planes, use maximum margin
dimensions to overcome
as a test
linearity constraints

A good
separation

Class 2 Class 2 Class 2

Class 1 Class 1 Class 1

Intuitively , a good separation is achieved by a hyperplane that
has largest distance to nearest training data point of any class

Since, larger the margin lower the generalization error(more
confident predications)

Class 2

Class 1

• {(x1,y1), (x2,y2), … , (xn,yn)
Given N samples • Where y = +1/ -1 are labels of data, x belongs to Rn

Find a hyperplane • wTxi+ b > 0 : for all i such that y=+1
wTx + b =0 • wTxi+ b < 0 : for all i such that y=-1

Functional Margin
• With respect to the training example, defined by
ˆγ(i)=y(i)(wT x(i) + b).
• Want functional margin to be large i.e. y(i)(wT x(i) + b) >> 0
• May rescale w and b, without altering the decision function
but multiplying functional margin by the scale factor
• Allows us to impose a normalization condition ||w|| = 1 and
consider the functional margin of (w/||w||,b/||w||)
• w.r.t. training set defined by ˆγ = min ˆγ(i) for all i

Geometric margin
• Defined by γ(i)=y(i)((w/||w||)Tx(i)+b/||w||).
• If ||w|| = 1, functional margin = geometric margin
• Invariant to scaling of parameters w and b. w may be scaled such
that ||w|| = 1
• Also, γ = min γ(i) for all i
Now, Objective is to
Maximize γ w.r.t. γ,w,b s.t. Maximize ˆγ/||w|| w.r.t. ˆγ,w,b s.t.

• y(i)(wTx(i) +b) >= γ for all i • y(i)(wTx(i) +b) >= ˆγ for all i
• ||w|| = 1

• Intrducing the scaling constraint that the functional margin be 1, the objective
function may further be simplified as to maximize 1/||w|| , or
Minimize (1/2)(||w||2) s.t.

• y(i)(wTx(i) +b) >= 1

Using Lagrangian to solve the inequality constrained optimization problem , we have

L = ½||w||2 - Σαi(yi(wTxi +b) - 1)

Setting gradiant to L w.r.t. w and b to 0 we have,

w = Σαiyixi for all i , Σαiyi = 0

Substituitng w in L we get the corresponding dual problem of the primal problem to

maximize W(α) = Σαi - ½ΣΣαiαjyiyjxiTxj , s.t. αi >=0 , Σαiyi = 0

Solve for α and recover

w = Σαiyixi , b∗ =( −maxi:y(i)=−1 wT x(i) + mini:y(i)=1 wT x(i))/2

For conversion of primal problem to dual problem the
following Karish-Kuhn-Tucker conditions must be satisfied
• (∂/∂wi)L(w, α) = 0, i = 1, . . . , n
• αi gi(w,b) = 0, i = 1, . . . , k
• gi(w,b) <= 0, i = 1, . . . , k
• αi >= 0
From the KKT complementary condition(2nd)

• αi > 0 => gi(w,b) = 0 (active constraint) => x(i),y(i) has functional margin 1
(support vectors)
• gi(w,b) < 0 => αi = 0 (inactive constraint, non-support vectors)

Support vectors
Class 2

Class 1

In case of non-linearly separable data , mapping data to high
dimensional feature space via non linear mapping function, φ increases
the likelihood that data is linearly separable

Use of kernel function, to simplify computations over high dimensional
mapped data, that corresponds to dot product of some non-linear
mapping of data

Having found αi , calculate a quantity that depends only on the inner
product between x (test point) and support vectors

Kernel function is the measure of similarity between the 2 vectors

A kernel function is valid if it satisfies the Mercer Theorem which states
that the corresponding kernel matrix must be symmetric positive semi-
definite (zTKz >= 0 )

Polynomial kernel with degree d
• K(x,y) = (xy + 1 )^d

Radial basis function kernel with width
• K(x,y) = exp(-||x-y||2/(2
• Feature space is infinite dimensional
Sigmoid with parameter and
• K(x,y) = tanh( xTy+
• It does not satisfy the Mercer condition on all and

High dimensionality doesn’t guarantee linear separation; hypeplane might be
susceptible to outliers

Relax the constraint introducing ‘slack variables’, ξi, that allow violations of
constraint by a small quantity

Penalize the objective function for violation

Parameter C will control the trade off between penalty and margin.

So the objective now becomes, to minw,b,γ (1/2)||w||2 + C Σξi s.t.
y(i)(wTx(i)+b)>= 1 – ξi, ξi >=0

Tries to ensure that most examples have functional margin atleast 1

Formind the corresponding Lagrangian , the dual problem now is to:
maxαΣαi - ½ΣΣαiαjyiyjxiTxj , s.t. 0<=αi <= C , Σαiyi = 0 .

Class 2

Class 1

Parameter Selection

• The effectiveness of SVM depends on selection of kernel, kernel parameters
and the parameter C
• Common is Gaussian kernel, with a single parameter γ
• Best combination of C and γ is often selected by grid search with
exponentially increasing sequences of C and γ.
• Each combination is checked using crossvalidation and the one with best
accuracy is chosen.

Drawbacks
• Cannot be directly applied to
multiclass problems, but need use
of algorithms that convert
multiclass problem to multiple
binary class problems
• Uncalibrated class membership
probabilities

support vector machine

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to support vector machine

Similar to support vector machine (20)

support vector machine