Successfully reported this slideshow.
Upcoming SlideShare
×

# Support Vector Machine

30,293 views

Published on

Published in: Education, Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Support Vector Machine

1. 1. Support Vector Machine<br />Shao-Chuan Wang<br />1<br />
2. 2. Support Vector Machine<br />1D Classification Problem: how will you separate these data?(H1, H2, H3?)<br />2<br />H1<br />H2<br />H3<br />x<br />0<br />
3. 3. Support Vector Machine<br />2D Classification Problem: which H is better?<br />3<br />
4. 4. Max-Margin Classifier<br />Functional Margin<br />Geometric Margin<br />4<br />We feel more confident <br />when functional margin is larger<br />Note that scaling on w, b won’t change the plane.<br />Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).<br />
5. 5. Maximize margins<br />Optimization problem: maximize minimal geometric margin under constraints.<br />Introduce scaling factor such that<br />5<br />Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).<br />
6. 6. Optimization problem subject to constraints<br />Maximize f(x, y), subject to constraint g(x, y) = c<br />6<br />-> Lagrange multiplier method<br />
7. 7. Lagrange duality<br />Primal optimization problem:<br />GeneralizedLagrangian method<br />Primal optimization problem (equivalent form)<br />Dual optimization problem:<br />7<br />Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).<br />
8. 8. Dual Problem<br />The necessary conditions that equality holds:<br />f, giare convex, and hi are affine.<br />KKT conditions.<br />8<br />Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).<br />
9. 9. Optimal margin classifiers<br />Its Lagrangian<br />Its dual problem<br />9<br />Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).<br />
10. 10. Support Vector Machine (cont’d)<br />If not linearly separable, we can<br />Find a nonlinear solution<br />Technically, it’s a linear solution in higher-order space<br /> Kernel Trick<br />26<br />
11. 11. Kernel and feature mapping<br />Kernel:<br />Positive semi-definite<br />Symmetric<br />For example:<br />Loose Intuition<br />“similarity” between features<br />11<br />Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).<br />
12. 12. Soft Margin (L1 regularization)<br />12<br />C = ∞ leads to hard margin SVM, <br />Rychetsky (2001)<br />Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).<br />
13. 13. Why doesn’t my model fit well on test data ?<br />13<br />
14. 14. Bias/variance tradeoff<br />underfitting(high bias) overfitting(high variance) <br />Training Error = <br />Generalization Error =<br />14<br />In-sample error<br />Out-of-sample error<br />Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).<br />
15. 15. Bias/variance tradeoff<br />15<br />T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer series in statistics. Springer, New York, 2001.<br />
16. 16. Is training error a good estimator of generalization error?<br />16<br />
17. 17. Chernoff bound (|H|=finite)<br />Lemma: Assume Z1, Z2, …, Zmare drawn iid from Bernoulli(φ), and<br /> and let γ &gt; 0 be fixed. Then,<br /> based on this lemma, one can find, with probability 1-δ<br />(k = # of hypotheses)<br />17<br />Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).<br />
18. 18. Chernoff bound (|H|=infinite)<br />VC Dimension d : The size of largest set that H can shatter.<br />e.g. <br />H = linear classifiers<br />in 2-D<br />VC(H) = 3<br />With probability at least 1-δ,<br />18<br />Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).<br />
19. 19. Model Selection<br /><ul><li>Cross Validation: Estimator of generalization error
20. 20. K-fold: train on k-1 pieces, test on the remaining (here we will get one test error estimation).</li></ul> Average k test error estimations, say, 2%. Then 2% is the estimation of generalization error for this machine learner.<br /><ul><li>Leave-one-out cross validation (m-fold, m = training sample size)</li></ul>19<br />train<br />train<br />validate<br />train<br />train<br />train<br />
21. 21. Model Selection<br />Loop possible parameters:<br />Pick one set of parameter, e.g. C = 2.0<br />Do cross validation, get a error estimation<br />Pick the Cbest (with minimal error estimation) as the parameter<br />20<br />
22. 22. Multiclass SVM<br />One against one<br />There are binary SVMs. (1v2, 1v3, …)<br />To predict, each SVM can vote between 2 classes.<br />One against all<br />There are k binary SVMs. (1 v rest, 2 v rest, …)<br />To predict, evaluate , pick the largest.<br />Multiclass SVM by solving ONE optimization problem<br />21<br />K = <br />1<br />3<br />5<br />3<br />2<br />1<br />1<br />2<br />3<br />4<br />5<br />6<br />K = 3<br />poll <br />Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265-292.<br />
23. 23. Multiclass SVM (2/2)<br />DAGSVM (Directed Acyclic Graph SVM)<br />22<br />
24. 24. An Example: image classification<br />Process<br />23<br />K = 6<br />1/4 <br />3/4<br />1 0:49 1:25 …<br />1 0:49 1:25 …<br />：<br /> ：<br />2 0:49 1:25 …<br />：<br />Test Data<br />Accuracy<br />
25. 25. An Example: image classification<br />Results<br />Run Multi-class SVM 100 times for both (linear/Gaussian).<br />Accuracy Histogram<br />24<br />