support vector machine


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

support vector machine

  1. 1. Submitted by:Garisha Chowdhary ,MCSE 1st year,Jadavpur University
  2. 2. A set of related supervised learning methodsNon-probablistic binary linear classifierLinear learners like perceptrons but unlike them uses concept of :maximum margin ,linearization and kernel functionUsed for classification and regression analysis
  3. 3. Map non-lineraly separable Select between hyper instances to higher planes, use maximum margin dimensions to overcome as a test linearity constraints A good separation Class 2 Class 2 Class 2Class 1 Class 1 Class 1
  4. 4. Intuitively , a good separation is achieved by a hyperplane thathas largest distance to nearest training data point of any classSince, larger the margin lower the generalization error(moreconfident predications) Class 2 Class 1
  5. 5. • {(x1,y1), (x2,y2), … , (xn,yn)Given N samples • Where y = +1/ -1 are labels of data, x belongs to RnFind a hyperplane • wTxi+ b > 0 : for all i such that y=+1 wTx + b =0 • wTxi+ b < 0 : for all i such that y=-1Functional Margin• With respect to the training example, defined by ˆγ(i)=y(i)(wT x(i) + b).• Want functional margin to be large i.e. y(i)(wT x(i) + b) >> 0• May rescale w and b, without altering the decision function but multiplying functional margin by the scale factor• Allows us to impose a normalization condition ||w|| = 1 and consider the functional margin of (w/||w||,b/||w||)• w.r.t. training set defined by ˆγ = min ˆγ(i) for all i
  6. 6. Geometric margin • Defined by γ(i)=y(i)((w/||w||)Tx(i)+b/||w||). • If ||w|| = 1, functional margin = geometric margin • Invariant to scaling of parameters w and b. w may be scaled such that ||w|| = 1 • Also, γ = min γ(i) for all iNow, Objective is to Maximize γ w.r.t. γ,w,b s.t. Maximize ˆγ/||w|| w.r.t. ˆγ,w,b s.t.• y(i)(wTx(i) +b) >= γ for all i • y(i)(wTx(i) +b) >= ˆγ for all i• ||w|| = 1• Intrducing the scaling constraint that the functional margin be 1, the objectivefunction may further be simplified as to maximize 1/||w|| , or Minimize (1/2)(||w||2) s.t. • y(i)(wTx(i) +b) >= 1
  7. 7. Using Lagrangian to solve the inequality constrained optimization problem , we have L = ½||w||2 - Σαi(yi(wTxi +b) - 1) Setting gradiant to L w.r.t. w and b to 0 we have, w = Σαiyixi for all i , Σαiyi = 0Substituitng w in L we get the corresponding dual problem of the primal problem to maximize W(α) = Σαi - ½ΣΣαiαjyiyjxiTxj , s.t. αi >=0 , Σαiyi = 0 Solve for α and recover w = Σαiyixi , b∗ =( −maxi:y(i)=−1 wT x(i) + mini:y(i)=1 wT x(i))/2
  8. 8. For conversion of primal problem to dual problem the following Karish-Kuhn-Tucker conditions must be satisfied • (∂/∂wi)L(w, α) = 0, i = 1, . . . , n • αi gi(w,b) = 0, i = 1, . . . , k • gi(w,b) <= 0, i = 1, . . . , k • αi >= 0 From the KKT complementary condition(2nd)• αi > 0 => gi(w,b) = 0 (active constraint) => x(i),y(i) has functional margin 1 (support vectors)• gi(w,b) < 0 => αi = 0 (inactive constraint, non-support vectors) Support vectors Class 2 Class 1
  9. 9. In case of non-linearly separable data , mapping data to highdimensional feature space via non linear mapping function, φ increasesthe likelihood that data is linearly separableUse of kernel function, to simplify computations over high dimensionalmapped data, that corresponds to dot product of some non-linearmapping of dataHaving found αi , calculate a quantity that depends only on the innerproduct between x (test point) and support vectorsKernel function is the measure of similarity between the 2 vectorsA kernel function is valid if it satisfies the Mercer Theorem which statesthat the corresponding kernel matrix must be symmetric positive semi-definite (zTKz >= 0 )
  10. 10. Polynomial kernel with degree d• K(x,y) = (xy + 1 )^dRadial basis function kernel with width• K(x,y) = exp(-||x-y||2/(2• Feature space is infinite dimensionalSigmoid with parameter and• K(x,y) = tanh( xTy+• It does not satisfy the Mercer condition on all and
  11. 11. High dimensionality doesn’t guarantee linear separation; hypeplane might besusceptible to outliersRelax the constraint introducing ‘slack variables’, ξi, that allow violations ofconstraint by a small quantityPenalize the objective function for violationParameter C will control the trade off between penalty and margin.So the objective now becomes, to minw,b,γ (1/2)||w||2 + C Σξi s.t.y(i)(wTx(i)+b)>= 1 – ξi, ξi >=0Tries to ensure that most examples have functional margin atleast 1Formind the corresponding Lagrangian , the dual problem now is to:maxαΣαi - ½ΣΣαiαjyiyjxiTxj , s.t. 0<=αi <= C , Σαiyi = 0 .
  12. 12. Class 2 Class 1 Parameter Selection• The effectiveness of SVM depends on selection of kernel, kernel parameters and the parameter C• Common is Gaussian kernel, with a single parameter γ• Best combination of C and γ is often selected by grid search with exponentially increasing sequences of C and γ.• Each combination is checked using crossvalidation and the one with best accuracy is chosen.
  13. 13. Drawbacks• Cannot be directly applied to multiclass problems, but need use of algorithms that convert multiclass problem to multiple binary class problems• Uncalibrated class membership probabilities