Support Vector MachineShao-Chuan Wang1
Support Vector Machine1D Classification Problem: how will you separate these data?(H1, H2, H3?)2H1H2H3x0
Support Vector Machine2D Classification Problem: which H is better?3
Max-Margin ClassifierFunctional MarginGeometric Margin4We feel more confident when functional margin is largerNote that scaling on w, b won’t  change the plane.Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Maximize marginsOptimization problem: maximize minimal geometric margin under constraints.Introduce scaling factor such that5Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Optimization problem subject to constraintsMaximize f(x, y), subject to constraint g(x, y) = c6-> Lagrange multiplier method
Lagrange dualityPrimal optimization problem:GeneralizedLagrangian methodPrimal optimization problem (equivalent form)Dual optimization problem:7Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Dual ProblemThe necessary conditions that equality holds:f, giare convex, and hi are affine.KKT conditions.8Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Optimal margin classifiersIts LagrangianIts dual problem9Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Support Vector Machine (cont’d)If not linearly separable, we canFind a nonlinear solutionTechnically, it’s a linear solution in higher-order space	Kernel Trick26
Kernel and feature mappingKernel:Positive semi-definiteSymmetricFor example:Loose Intuition“similarity” between features11Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Soft Margin (L1 regularization)12C = ∞ leads to hard margin SVM, Rychetsky (2001)Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Why doesn’t my model fit well on test data ?13
Bias/variance tradeoffunderfitting(high bias) overfitting(high variance) Training Error = Generalization Error =14In-sample errorOut-of-sample errorAndrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Bias/variance tradeoff15T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer series in statistics. Springer, New York, 2001.
Is training error a good estimator of generalization error?16
Chernoff bound (|H|=finite)Lemma: Assume Z1, Z2, …, Zmare drawn iid from Bernoulli(φ), and	and let γ > 0 be fixed. Then,	based on this lemma, one can find, with probability 1-δ(k = # of hypotheses)17Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
Chernoff bound (|H|=infinite)VC Dimension d : The size of largest set that H can shatter.e.g. H = linear classifiersin 2-DVC(H) = 3With probability at least 1-δ,18Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
Model SelectionCross Validation: Estimator of generalization error
K-fold: train on k-1 pieces, test on the remaining (here we will get one test error estimation).    Average k test error estimations, say, 2%. Then 2% is the estimation of generalization error for this machine learner.Leave-one-out cross validation (m-fold, m = training sample size)19traintrainvalidatetraintraintrain
Model SelectionLoop possible parameters:Pick one set of parameter, e.g. C = 2.0Do cross validation, get a error estimationPick the Cbest (with minimal error estimation) as the parameter20
Multiclass SVMOne against oneThere are         binary SVMs. (1v2, 1v3, …)To predict, each SVM can vote between 2 classes.One against allThere are k binary SVMs. (1 v rest, 2 v rest, …)To predict, evaluate                     , pick the largest.Multiclass SVM by solving ONE optimization problem21K = 135321123456K = 3poll Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265-292.
Multiclass SVM (2/2)DAGSVM (Directed Acyclic Graph SVM)22
An Example: image classificationProcess23K = 61/4 3/41 0:49 1:25 …1 0:49 1:25 …:     :2 0:49 1:25 …:Test DataAccuracy

Support Vector Machine

  • 1.
  • 2.
    Support Vector Machine1DClassification Problem: how will you separate these data?(H1, H2, H3?)2H1H2H3x0
  • 3.
    Support Vector Machine2DClassification Problem: which H is better?3
  • 4.
    Max-Margin ClassifierFunctional MarginGeometricMargin4We feel more confident when functional margin is largerNote that scaling on w, b won’t change the plane.Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
  • 5.
    Maximize marginsOptimization problem:maximize minimal geometric margin under constraints.Introduce scaling factor such that5Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
  • 6.
    Optimization problem subjectto constraintsMaximize f(x, y), subject to constraint g(x, y) = c6-> Lagrange multiplier method
  • 7.
    Lagrange dualityPrimal optimizationproblem:GeneralizedLagrangian methodPrimal optimization problem (equivalent form)Dual optimization problem:7Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
  • 8.
    Dual ProblemThe necessaryconditions that equality holds:f, giare convex, and hi are affine.KKT conditions.8Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
  • 9.
    Optimal margin classifiersItsLagrangianIts dual problem9Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
  • 10.
    Support Vector Machine(cont’d)If not linearly separable, we canFind a nonlinear solutionTechnically, it’s a linear solution in higher-order space Kernel Trick26
  • 11.
    Kernel and featuremappingKernel:Positive semi-definiteSymmetricFor example:Loose Intuition“similarity” between features11Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
  • 12.
    Soft Margin (L1regularization)12C = ∞ leads to hard margin SVM, Rychetsky (2001)Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
  • 13.
    Why doesn’t mymodel fit well on test data ?13
  • 14.
    Bias/variance tradeoffunderfitting(high bias)overfitting(high variance) Training Error = Generalization Error =14In-sample errorOut-of-sample errorAndrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
  • 15.
    Bias/variance tradeoff15T. Hastie,R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer series in statistics. Springer, New York, 2001.
  • 16.
    Is training errora good estimator of generalization error?16
  • 17.
    Chernoff bound (|H|=finite)Lemma:Assume Z1, Z2, …, Zmare drawn iid from Bernoulli(φ), and and let γ > 0 be fixed. Then, based on this lemma, one can find, with probability 1-δ(k = # of hypotheses)17Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
  • 18.
    Chernoff bound (|H|=infinite)VCDimension d : The size of largest set that H can shatter.e.g. H = linear classifiersin 2-DVC(H) = 3With probability at least 1-δ,18Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
  • 19.
    Model SelectionCross Validation:Estimator of generalization error
  • 20.
    K-fold: train onk-1 pieces, test on the remaining (here we will get one test error estimation). Average k test error estimations, say, 2%. Then 2% is the estimation of generalization error for this machine learner.Leave-one-out cross validation (m-fold, m = training sample size)19traintrainvalidatetraintraintrain
  • 21.
    Model SelectionLoop possibleparameters:Pick one set of parameter, e.g. C = 2.0Do cross validation, get a error estimationPick the Cbest (with minimal error estimation) as the parameter20
  • 22.
    Multiclass SVMOne againstoneThere are binary SVMs. (1v2, 1v3, …)To predict, each SVM can vote between 2 classes.One against allThere are k binary SVMs. (1 v rest, 2 v rest, …)To predict, evaluate , pick the largest.Multiclass SVM by solving ONE optimization problem21K = 135321123456K = 3poll Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265-292.
  • 23.
    Multiclass SVM (2/2)DAGSVM(Directed Acyclic Graph SVM)22
  • 24.
    An Example: imageclassificationProcess23K = 61/4 3/41 0:49 1:25 …1 0:49 1:25 …: :2 0:49 1:25 …:Test DataAccuracy