- 1. Digg Data Support Vector Machine Ankit Sharma www.diggdata.in without tears
- 2. Digg Data Content SVM and its application Basic SVM •Hyperplane •Understanding of basics •Optimization Soft margin SVM Non-linear decision boundary SVMs in “loss + penalty” form Kernel method •Gaussian kernel SVM usage beyond classification Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 2
- 3. Digg Data • In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. • Properties of SVM : Duality Kernels Margin Convexity Sparseness SVM : Support Vector Machine Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 3
- 4. Digg Data Time Series analysis Classification Anomaly detection Regression Machine Vision Text categorization Application of SVM Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 4
- 5. Digg Data Basic concept of SVM Find a linear decision surface (“hyperplane”) that can separate classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”) Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 5
- 6. Digg Data Hyperplane as a Decision boundary • A hyperplane is a linear decision surface that splits the space into two parts; • It is obvious that a hyperplane is a binary classifier Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 6
- 7. Digg Data Equation of a hyperplane An equation of a hyperplane is defined by a point (P0) and a perpendicular vector to the plane (𝑤) at that point. Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 7
- 8. Digg Data • g(x) is a linear function: x1 x2 w x + b < 0 w x + b > 0 A hyper-plane in the feature space (Unit-length) normal vector of the hyper-plane: w n w n Understanding the basics Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 8
- 9. Digg Data x1 x2How to classify these points using a linear discriminant function in order to minimize the error rate? Infinite number of answers! Which one is the best? Understanding the basics denotes +1 denotes -1Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 9
- 10. Digg Data • The linear discriminant function (classifier) with the maximum margin is the best “safe zone” Margin is defined as the width that the boundary could be increased by before hitting a data point Why it is the best? Robust to outliners and thus strong generalization ability Margin x1 x2 Understanding the basics denotes +1 denotes -1Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 10
- 11. Digg Data • Given a set of data points: With a scale transformation on both w and b, the above is equivalent to x1 x2 {( , )}, 1,2, ,i iy i nx , where 𝑭𝒐𝒓 𝒚𝒊 = +𝟏, 𝑾 𝑿𝒊 + 𝒃> 0 𝑭𝒐𝒓 𝒚𝒊 = −𝟏, 𝑾 𝑿𝒊 + 𝒃 < 𝟎 𝑭𝒐𝒓 𝒚𝒊 = +𝟏, 𝑾 𝑿𝒊 + 𝒃> +1 𝑭𝒐𝒓 𝒚𝒊 = −𝟏, 𝑾 𝑿𝒊 + 𝒃 < -1 Understanding the basics denotes +1 denotes -1Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 11
- 12. Digg Data • We know that The margin width is: x1 x2 Margin x+ x+ x- ( ) 2 ( ) M x x n w x x w w n Support Vectors 𝑾 𝑿+ + 𝒃 = +𝟏 𝑾 𝑿− + 𝒃 = −𝟏 Understanding the basics denotes +1 denotes -1Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 12
- 13. Digg Data • Formulation: x1 x2 Margin x+ x+ x- n such that 2 maximize w 𝑭𝒐𝒓 𝒚𝒊 = +𝟏, 𝑾 𝑿𝒊 + 𝒃> +1 𝑭𝒐𝒓 𝒚𝒊 = −𝟏, 𝑾 𝑿𝒊 + 𝒃 < -1 Understanding the basics denotes +1 denotes -1Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 13
- 14. Digg Data • Formulation: x1 x2 Margin x+ x+ x- n 21 minimize 2 w such that 𝑭𝒐𝒓 𝒚𝒊 = +𝟏, 𝑾 𝑿𝒊 + 𝒃> +1 𝑭𝒐𝒓 𝒚𝒊 = −𝟏, 𝑾 𝑿𝒊 + 𝒃 < -1 Understanding the basics denotes +1 denotes -1Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 14
- 15. Digg Data • Formulation: x1 x2 Margin x+ x+ x- n 21 minimize 2 w such that 𝐲𝐢 𝐖 𝐗 + 𝐛 ≥ 𝟏 Understanding the basics denotes +1 denotes -1Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 15
- 16. Digg Data Basics of optimization: Convex functions • A function is called convex if the function lies below the straight line segment connecting two points, for any two points in the interval. • Property: Any local minimum is a global minimum! Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 16
- 17. Digg Data Basics of optimization: Quadratic programming • Quadratic programming (QP) is a special optimization problem: the function to optimize (“objective”) is quadratic, subject to linear constraints. • Convex QP problems have convex objective functions. • These problems can be solved easily and efficiently by greedy algorithms (because every local minimum is a global minimum). Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 17
- 18. Digg Data SVM optimization problem: Primal formulation • This is called “primal formulation of linear SVMs” • It is a convex quadratic programming (QP) optimization problem with n variables (wi, i= 1,…,n), where n is the number of features in the dataset. Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 18
- 19. Digg Data SVM optimization problem: Dual formulation • The previous problem can be recast in the so-called “dual form” giving rise to “dual formulation of linear SVMs”. • Apply the method of Lagrange multipliers. • We need to minimize this Lagrangian with respect to and simultaneously require that the derivative with respect to vanishes , all subject to the constraints that αi > 0 Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 19
- 20. Digg Data SVM optimization problem: Dual formulation Cond… It is also a convex quadratic programming problem but with N variables (αi, i= 1,…,N), where N is the number of samples. Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 20
- 21. Digg Data SVM optimization problem: Benefits of using dual formulation 1) No need to access original data, need to access only dot products. 2) Number of free parameters is bounded by the number of support vectors and not by the number of variables (beneficial for high-dimensional problems). Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 21
- 22. Digg Data Non linearly separable data: “Soft-margin” linear SVM Assign a “slack variable” to each instance , ξi > 0 which can be thought of distance from the separating hyperplane if an instance is misclassified and 0 otherwise. Primal formulation: Dual formulation: • When C is very large, the soft-margin SVM is equivalent to hard-margin SVM; • When C is very small, we admit misclassifications in the training data at the expense of having w-vector with small norm; • C has to be selected for the distribution at hand as it will be discussed later in this tutorial. Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 22
- 23. Digg Data SVMs in “loss + penalty” form • Many statistical learning algorithms (including SVMs) search for a decision function by solving the following optimization problem: Minimize (Loss+ λ Penalty) – Loss measures error of fitting the data – Penalty penalizes complexity of the learned function – λ is regularization parameter that balances Loss and Penalty • Overfitting → Poor generalization Can also be stated as Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 23
- 24. Digg Data Nonlinear decision boundary Non Linear Decision Boundary Kernel method Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 24
- 25. Digg Data Kernel method • Kernel methods involve – Nonlinear transformation of data to a higher dimensional feature space induced by a Mercer kernel – Detection of optimal linear solutions in the kernel feature space • Transformation to a higher dimensional space is expected to be helpful in conversion of nonlinear relations into linear relations (Cover’s theorem) – Nonlinearly separable patterns to linearly separable patterns – Nonlinear regression to linear regression – Nonlinear separation of clusters to linear separation of clusters • Pattern analysis methods are implemented in such a way that the kernel feature space representation is not explicitly required. They involve computation of pair-wise inner-products only. • The pair-wise inner-products are computed efficiently directly from the original representation of data using a kernel function (Kernel trick) Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 25
- 26. Digg Data Kernel trick Not every function RN×RN -> R can be a valid kernel; it has to satisfy so-called Mercer conditions. Otherwise, the underlying quadratic program may not be solvable. Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 26
- 27. Digg Data Popular kernels Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 27
- 28. Digg Data Gaussian kernel Consider the Gaussian kernel: Geometrically, this is a “bump” or “cavity” centered at the training data point 𝑥j : The resulting mapping function is a combination of bumps and cavities. Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 28
- 29. Digg Data SVM usage beyond classification Regression analysis (ε-Support vector regression) Anomaly detection (One-class SVM) Clustering analysis (Support Vector Domain Description) Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 29
- 30. Digg Data Thank you Thursday, August 7, 2014 WITHOUT TEARS SERIES | www.diggdata.in 30