Outline (2/2) Some basis about Learning theory Bias/variance tradeoff (underfitting vs overfitting) Chernoff bound and VC dimension Model selection Cross validation Dimension Reduction Multiclass SVM One against one One against all Image Classification by SVM Process Results 3
Intuition: Margins Functional Margin Geometric Margin We feel more confident when functional margin is larger Note that scaling on w, b won’t change the plane. 4 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Maximize margins Optimization problem: maximize minimal geometric margin under constraints. Introduce scaling factor such that 5 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Lagrange duality Primal optimization problem: Generalized Lagrangian Primal optimization problem (equivalent form) Dual optimization problem: 6 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Dual Problem The necessary conditions that equality holds: f, giare convex, and hi are affine. KKT conditions. 7 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Optimal margin classifiers Its Lagrangian Its dual problem 8 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Kernel and feature mapping Kernel: Positive semi-definite Symmetric For example: Loose Intuition “similarity” between features 9 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Soft Margin (L1 regularization) C = ∞ leads to hard margin SVM, Rychetsky (2001) 10 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Why doesn’t my model fit well on test data ? 11
Some basis about Learning theory Bias/variance tradeoff underfitting (high bias) (high variance) overfitting Training Error = Generalization Error = 12 Andrew Ng. Part V Support Vector Machines. CS229 Lecture Notes (2008).
Bias/variance tradeoff T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer series in statistics. Springer, New York, 2001. 13
Is training error a good estimator of generalization error? 14
Chernoff bound (|H|=finite) Lemma: Assume Z1, Z2, …, Zmare drawn iid from Bernoulli(φ), and and let γ > 0 be fixed. Then, based on this lemma, one can find, with probability 1-δ (k = # of hypotheses) 15 Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
Chernoff bound (|H|=infinite) VC Dimension d : The size of largest set that H can shatter. e.g. H = linear classifiers in 2-D VC(H) = 3 With probability at least 1-δ, 16 Andrew Ng. Part VI Learning Theory. CS229 Lecture Notes (2008).
Model Selection Cross Validation: Estimator of generalization error K-fold: train on k-1 pieces, test on the remaining (here we will get one test error estimation). Average k test error estimations, say, 2%. Then 2% is the estimation of generalization error for this machine learner. Leave-one-out cross validation (m-fold, m = training sample size) train train validate train train train 17
Model Selection Loop possible parameters: Pick one set of parameter, e.g. C = 2.0 Do cross validation, get a error estimation Pick the Cbest (with minimal error estimation) as the parameter 18
Dimensionality Reduction Which features are more “important”? Wrapper model feature selection Forward/backward search: add/remove a feature at a time, then evaluate the model with the new feature set. Filter feature selection Compute score S(i) that measures how informative xi is about the class label y S(i) can be correlation Corr(x_i, y), or mutual information MI(x_i, y), etc. Principal Component Analysis (PCA) Vector Quantization (VQ) 19
Multiclass SVM One against one There are binary SVMs. (1v2, 1v3, …) To predict, each SVM can vote between 2 classes. One against all There are k binary SVMs. (1 v rest, 2 v rest, …) To predict, evaluate , pick the largest. Multiclass SVM by solving ONE optimization problem K = 1 3 5 3 2 1 1 2 3 4 5 6 K = 3 poll Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265-292. 20
Image Classification by SVM Process K = 6 1/4 3/4 1 0:49 1:25 … 1 0:49 1:25 … ： ： 2 0:49 1:25 … ： Test Data Accuracy 21
Image Classification by SVM Results Run Multi-class SVM 100 times for both (linear/Gaussian). Accuracy Histogram 22
Image Classification by SVM If we throw object data that the machine never saw before. 23
~ Thank You ~ Shao-Chuan Wang CITI, Academia Sinica 24