0
Advanced Computing Seminar  Data Mining and Its Industrial Applications  — Chapter 8 —   Support Vector Machines   <ul><li...
Outline <ul><li>Introduction </li></ul><ul><li>Support Vector Machine </li></ul><ul><li>Non-linear Classification </li></u...
History <ul><li>SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis </li></ul><ul><li>...
What is SVM? <ul><li>SVMs are learning systems that  </li></ul><ul><ul><li>use a hypothesis space of  linear functions </l...
Linear Classifiers  y est denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore...
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this da...
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this da...
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this da...
Linear Classifiers f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) How would you classify this da...
Maximum Margin f  x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x   -  b ) The  maximum margin linear classif...
Model of Linear Classification <ul><li>Binary classification is frequently performed by using a real-valued  hypothesis  f...
The concept of Hyperplane <ul><li>For a binary  linear separable  training set, we can find at least a  hyperplane   (w,b)...
Tuning the Hyperplane (w,b)  <ul><li>The  Perceptron  Algorithm </li></ul><ul><ul><li>Proposed by  Frank Rosenblatt  in 19...
The  Perceptron  Algorithm <ul><li>The number of mistakes is at most </li></ul>
The Geometric margin  -> <ul><li>The Euclidean distance of an example (x i ,y i ) from the decision boundary </li></ul>
The Geometric margin <ul><li>The margin of a training set  S </li></ul><ul><li>Maximal Margin Hyperplane </li></ul><ul><ul...
How to Find the optimal solution ? <ul><li>The drawback of the  perceptron  algorithm </li></ul><ul><ul><li>The algorithm ...
The Maximal Margin Classifier <ul><li>The simplest model of SVM </li></ul><ul><ul><li>Finds the  maximal margin hyperplane...
Support Vector Classifiers <ul><li>Support vector machines  </li></ul><ul><ul><li>Cortes and Vapnik (1995) </li></ul></ul>...
Formalizi the geometric margin <ul><li>Assumes that  </li></ul><ul><li>The geometric margin  </li></ul><ul><ul><li>In orde...
Minimizing the norm  ->  <ul><li>Because </li></ul><ul><li>We can re-formalize the optimization problem </li></ul>
Minimizing the norm  -> <ul><li>Uses the  Lagrangian  function </li></ul><ul><li>Obtained </li></ul><ul><li>Resubstituting...
Minimizing the norm <ul><li>Finds the minimum  is equivalent to find the maximum  </li></ul><ul><li>The strategies for min...
The Support Vector <ul><li>The condition of the optimization problem states that </li></ul><ul><ul><li>This implies that o...
The optimal hypothesis (w,b) <ul><li>The two parameters can be obtained from </li></ul><ul><li>The hypothesis is </li></ul>
Soft Margin Optimization <ul><li>The main problem with the maximal margin classifier is that it always products perfectly ...
Non-linear Classification <ul><li>The problem </li></ul><ul><ul><li>The maximal margin classifier is an important concept,...
A learning machine <ul><li>A learning machine  f   takes an input  x  and transforms it, somehow using weights   , into a...
Some definitions <ul><li>Given some machine  f </li></ul><ul><li>And under the assumption that all training points  (x k ,...
Some definitions <ul><li>Given some machine  f </li></ul><ul><li>And under the assumption that all training points  (x k ,...
Vapnik-Chervonenkis  Dimension  <ul><li>Given some machine  f , let  h  be its VC dimension. </li></ul><ul><li>h  is a mea...
Structural Risk Minimization <ul><li>Let   (f)  = the set of functions representable by f. </li></ul><ul><li>Suppose  </l...
Kernel-Induced Feature Space <ul><li>Mapping the data of space  X  into space  F   </li></ul>
Implicit Mapping into Feature Space <ul><li>For the non-linear separable data set, we can modify the hypothesis to map imp...
Kernel Function <ul><li>A Kernel is a function  K , such that for all </li></ul><ul><li>The benefits </li></ul><ul><ul><li...
Kernel function
The Polynomial Kernel <ul><ul><li>The kind of kernel represents the inner product of two vector(point) in a feature space ...
 
 
Text Categorization Inductive learning  Inpute : Output :  f(x) = confidence(class) In the case of text classification ,th...
PROPERTIES OF TEXT-CLASSIFICATION TASKS <ul><li>High-Dimensional Feature Space. </li></ul><ul><li>Sparse Document Vectors....
Text representation and feature selection <ul><li>Binary feature </li></ul><ul><li>term frequency  </li></ul><ul><li>Inver...
 
Learning SVMS <ul><li>To learn the vector of feature weights  </li></ul><ul><li>Linear SVMS </li></ul><ul><li>Polynomial c...
Processing <ul><li>Text files are processed to produce a vector of words </li></ul><ul><li>Select 300 words with highest  ...
An example - Reuters ( trends & controversies) <ul><li>Category : interest </li></ul><ul><li>Weight vector </li></ul><ul><...
 
Text Categorization Results Dumais et al. (1998)
Apply to the Linear Classifier <ul><li>Substitutes to the hypothesis </li></ul><ul><li>Substitutes to the margin optimizat...
SVMs and PAC Learning <ul><li>Theorems connect PAC theory to the size of the  margin </li></ul><ul><li>Basically, the  lar...
PAC and the Number of Support Vectors <ul><li>The fewer the support vectors, the better the generalization will be </li></...
VC-dimension of an SVM <ul><li>Very  loosely speaking there is some theory which under some different assumptions puts an ...
Finding Non-Linear Separating Surfaces <ul><li>Map inputs into new space </li></ul><ul><ul><li>Example: features  x 1   x ...
Summary <ul><li>Maximize the margin between positive and negative examples (connects to PAC theory) </li></ul><ul><li>Non-...
References <ul><li>Vladimir Vapnik.  The Nature of Statistical Learning Theory , Springer, 1995  </li></ul><ul><li>Andrew ...
www.intsci.ac.cn/shizz / <ul><li>Questions?! </li></ul>
Upcoming SlideShare
Loading in...5
×

Support Vector Machines

8,766

Published on

Published in: Technology, Education
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,766
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
622
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Transcript of "Support Vector Machines"

  1. 1. Advanced Computing Seminar Data Mining and Its Industrial Applications — Chapter 8 — Support Vector Machines <ul><li>Zhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr </li></ul><ul><li>Knowledge and Software Engineering Lab </li></ul><ul><li>Advanced Computing Research Centre </li></ul><ul><li>School of Computer and Information Science </li></ul><ul><li>University of South Australia </li></ul>
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Support Vector Machine </li></ul><ul><li>Non-linear Classification </li></ul><ul><li>SVM and PAC </li></ul><ul><li>Applications </li></ul><ul><li>Summary </li></ul>
  3. 3. History <ul><li>SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis </li></ul><ul><li>SVMs introduced by Boser, Guyon, Vapnik in COLT-92 </li></ul><ul><li>Initially popularized in the NIPS community, now an important and active field of all Machine Learning research. </li></ul><ul><li>Special issues of Machine Learning Journal, and Journal of Machine Learning Research. </li></ul>
  4. 4. What is SVM? <ul><li>SVMs are learning systems that </li></ul><ul><ul><li>use a hypothesis space of linear functions </li></ul></ul><ul><ul><li>in a high dimensional feature space — Kernel function </li></ul></ul><ul><ul><li>trained with a learning algorithm from optimization theory — Lagrange </li></ul></ul><ul><ul><li>Implements a learning bias derived from statistical learning theory — Generalisation SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis </li></ul></ul>
  5. 5. Linear Classifiers  y est denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore f x f ( x , w ,b ) = sign( w . x - b )
  6. 6. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  7. 7. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  8. 8. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  9. 9. Linear Classifiers f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore
  10. 10. Maximum Margin f x  y est denotes +1 denotes -1 f ( x , w ,b ) = sign( w . x - b ) The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Copyright © 2001, 2003, Andrew W. Moore
  11. 11. Model of Linear Classification <ul><li>Binary classification is frequently performed by using a real-valued hypothesis function: </li></ul><ul><ul><li>The input x is assigned to the positive class, if </li></ul></ul><ul><ul><li>Otherwise to the negative class. </li></ul></ul>
  12. 12. The concept of Hyperplane <ul><li>For a binary linear separable training set, we can find at least a hyperplane (w,b) which divides the space into two half spaces. </li></ul><ul><li>The definition of hyperplane </li></ul>
  13. 13. Tuning the Hyperplane (w,b) <ul><li>The Perceptron Algorithm </li></ul><ul><ul><li>Proposed by Frank Rosenblatt in 1956 </li></ul></ul><ul><li>Preliminary definition </li></ul><ul><ul><li>The functional margin of an example (x i ,y i ) </li></ul></ul><ul><ul><li>implies correct classification of (x i ,y i ) </li></ul></ul>
  14. 14. The Perceptron Algorithm <ul><li>The number of mistakes is at most </li></ul>
  15. 15. The Geometric margin -> <ul><li>The Euclidean distance of an example (x i ,y i ) from the decision boundary </li></ul>
  16. 16. The Geometric margin <ul><li>The margin of a training set S </li></ul><ul><li>Maximal Margin Hyperplane </li></ul><ul><ul><li>A hyperplane realising the maximun geometric margin </li></ul></ul><ul><li>The optimal linear classifier </li></ul><ul><ul><li>If it can form the Maximal Margin Hyperplane. </li></ul></ul>
  17. 17. How to Find the optimal solution ? <ul><li>The drawback of the perceptron algorithm </li></ul><ul><ul><li>The algorithm may give a different solution depending on the order in which the examples are processed. </li></ul></ul><ul><li>The superiority of SVM </li></ul><ul><ul><li>The kind of learning machines tune the solution based on the optimization theory . </li></ul></ul>
  18. 18. The Maximal Margin Classifier <ul><li>The simplest model of SVM </li></ul><ul><ul><li>Finds the maximal margin hyperplane in an chosen kernel-induced feature space. </li></ul></ul><ul><li>A convex optimization problem </li></ul><ul><ul><li>Minimizing a quadratic function under linear inequality constrains </li></ul></ul>
  19. 19. Support Vector Classifiers <ul><li>Support vector machines </li></ul><ul><ul><li>Cortes and Vapnik (1995) </li></ul></ul><ul><ul><li>well suited for high-dimensional data </li></ul></ul><ul><ul><li>binary classification </li></ul></ul><ul><li>Training set </li></ul><ul><ul><li>D = {( x i ,y i ), i=1,…,n}, x i  R m and y i  {-1,1} </li></ul></ul><ul><li>Linear discriminant classifier </li></ul><ul><ul><li>Separating hyperplane </li></ul></ul><ul><ul><li>{ x : g( x ) = w T x + w 0 = 0 } </li></ul></ul><ul><ul><ul><li>model parameters: w  R m and w 0  R </li></ul></ul></ul>
  20. 20. Formalizi the geometric margin <ul><li>Assumes that </li></ul><ul><li>The geometric margin </li></ul><ul><ul><li>In order to find the maximum ,we must find the minimum </li></ul></ul>
  21. 21. Minimizing the norm -> <ul><li>Because </li></ul><ul><li>We can re-formalize the optimization problem </li></ul>
  22. 22. Minimizing the norm -> <ul><li>Uses the Lagrangian function </li></ul><ul><li>Obtained </li></ul><ul><li>Resubstituting into the primal to obtain </li></ul>
  23. 23. Minimizing the norm <ul><li>Finds the minimum is equivalent to find the maximum </li></ul><ul><li>The strategies for minimizing differentiable function </li></ul><ul><ul><li>Decomposition </li></ul></ul><ul><ul><li>Sequential Minimal Optimization (SMO) </li></ul></ul>
  24. 24. The Support Vector <ul><li>The condition of the optimization problem states that </li></ul><ul><ul><li>This implies that only for input xi for which the functional margin is one </li></ul></ul><ul><ul><li>This implies that it lies closest to the hyperplane </li></ul></ul><ul><ul><li>The corresponding </li></ul></ul>
  25. 25. The optimal hypothesis (w,b) <ul><li>The two parameters can be obtained from </li></ul><ul><li>The hypothesis is </li></ul>
  26. 26. Soft Margin Optimization <ul><li>The main problem with the maximal margin classifier is that it always products perfectly a consistent hypothesis </li></ul><ul><ul><li>a hypothesis with no training error </li></ul></ul><ul><li>Relax the boundary </li></ul>
  27. 27. Non-linear Classification <ul><li>The problem </li></ul><ul><ul><li>The maximal margin classifier is an important concept, but it cannot be used in many real-world problems </li></ul></ul><ul><ul><li>There will in general be no linear separation in the feature space. </li></ul></ul><ul><li>The solution </li></ul><ul><ul><li>Maps the data into another space that can be separated linearly. </li></ul></ul>
  28. 28. A learning machine <ul><li>A learning machine f takes an input x and transforms it, somehow using weights  , into a predicted output y est = +/- 1 </li></ul>f x  y est  is some vector of adjustable parameters
  29. 29. Some definitions <ul><li>Given some machine f </li></ul><ul><li>And under the assumption that all training points (x k ,y k ) were drawn i.i.d from some distribution. </li></ul><ul><li>And under the assumption that future test points will be drawn from the same distribution </li></ul><ul><li>Define </li></ul>Official terminology
  30. 30. Some definitions <ul><li>Given some machine f </li></ul><ul><li>And under the assumption that all training points (x k ,y k ) were drawn i.i.d from some distribution. </li></ul><ul><li>And under the assumption that future test points will be drawn from the same distribution </li></ul><ul><li>Define </li></ul>Official terminology R = #training set data points
  31. 31. Vapnik-Chervonenkis Dimension <ul><li>Given some machine f , let h be its VC dimension. </li></ul><ul><li>h is a measure of f ’s power ( h does not depend on the choice of training set) </li></ul><ul><li>Vapnik showed that with probability 1-  </li></ul>This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f
  32. 32. Structural Risk Minimization <ul><li>Let  (f) = the set of functions representable by f. </li></ul><ul><li>Suppose </li></ul><ul><li>Then </li></ul><ul><li>We’re trying to decide which machine to use. </li></ul><ul><li>We train each machine and make a table… </li></ul>f 4 4 f 5 5 f 6 6  f 3 3 f 2 2 f 1 1 Choice Probable upper bound on TESTERR VC-Conf TRAINERR f i i
  33. 33. Kernel-Induced Feature Space <ul><li>Mapping the data of space X into space F </li></ul>
  34. 34. Implicit Mapping into Feature Space <ul><li>For the non-linear separable data set, we can modify the hypothesis to map implicitly the data to another feature space </li></ul>
  35. 35. Kernel Function <ul><li>A Kernel is a function K , such that for all </li></ul><ul><li>The benefits </li></ul><ul><ul><li>Solve the computational problem of working with many dimensions </li></ul></ul>
  36. 36. Kernel function
  37. 37. The Polynomial Kernel <ul><ul><li>The kind of kernel represents the inner product of two vector(point) in a feature space of dimension. </li></ul></ul><ul><ul><li>For example </li></ul></ul>
  38. 40. Text Categorization Inductive learning Inpute : Output : f(x) = confidence(class) In the case of text classification ,the attribute are words in the document ,and the classes are the categories.
  39. 41. PROPERTIES OF TEXT-CLASSIFICATION TASKS <ul><li>High-Dimensional Feature Space. </li></ul><ul><li>Sparse Document Vectors. </li></ul><ul><li>High Level of Redundancy. </li></ul>
  40. 42. Text representation and feature selection <ul><li>Binary feature </li></ul><ul><li>term frequency </li></ul><ul><li>Inverse document frequency </li></ul><ul><ul><li>n is the total number of documents </li></ul></ul><ul><ul><li>DF(w) is the number of documents the word occurs in </li></ul></ul>
  41. 44. Learning SVMS <ul><li>To learn the vector of feature weights </li></ul><ul><li>Linear SVMS </li></ul><ul><li>Polynomial classifiers </li></ul><ul><li>Radial basis functions </li></ul>
  42. 45. Processing <ul><li>Text files are processed to produce a vector of words </li></ul><ul><li>Select 300 words with highest mutual information with each category(remove stopwords) </li></ul><ul><li>A separate classifier is learned for each category. </li></ul>
  43. 46. An example - Reuters ( trends & controversies) <ul><li>Category : interest </li></ul><ul><li>Weight vector </li></ul><ul><li>large positive weights : prime (.70), rate (.67), interest (.63), rates (.60), and discount (.46) </li></ul><ul><li>large negative weights: group (–.24),year (–.25), sees (–.33) world (–.35), and dlrs (–.71) </li></ul>
  44. 48. Text Categorization Results Dumais et al. (1998)
  45. 49. Apply to the Linear Classifier <ul><li>Substitutes to the hypothesis </li></ul><ul><li>Substitutes to the margin optimization </li></ul>
  46. 50. SVMs and PAC Learning <ul><li>Theorems connect PAC theory to the size of the margin </li></ul><ul><li>Basically, the larger the margin, the better the expected accuracy </li></ul><ul><li>See, for example, Chapter 4 of Support Vector Machines by Christianini and Shawe-Taylor, Cambridge University Press, 2002 </li></ul>
  47. 51. PAC and the Number of Support Vectors <ul><li>The fewer the support vectors, the better the generalization will be </li></ul><ul><li>Recall, non-support vectors are </li></ul><ul><ul><li>Correctly classified </li></ul></ul><ul><ul><li>Don’t change the learned model if left out of the training set </li></ul></ul><ul><li>So </li></ul>
  48. 52. VC-dimension of an SVM <ul><li>Very loosely speaking there is some theory which under some different assumptions puts an upper bound on the VC dimension as </li></ul><ul><li>where </li></ul><ul><ul><li>Diameter is the diameter of the smallest sphere that can enclose all the high-dimensional term-vectors derived from the training set. </li></ul></ul><ul><ul><li>Margin is the smallest margin we’ll let the SVM use </li></ul></ul><ul><li>This can be used in SRM (Structural Risk Minimization) for choosing the polynomial degree, RBF  , etc. </li></ul><ul><ul><li>But most people just use Cross-Validation </li></ul></ul>Copyright © 2001, 2003, Andrew W. Moore
  49. 53. Finding Non-Linear Separating Surfaces <ul><li>Map inputs into new space </li></ul><ul><ul><li>Example: features x 1 x 2 </li></ul></ul><ul><ul><li>5 4 </li></ul></ul><ul><ul><li>Example: features x 1 x 2 x 1 2 x 2 2 x 1 *x 2 </li></ul></ul><ul><ul><li>5 4 25 16 20 </li></ul></ul><ul><li>Solve SVM program in this new space </li></ul><ul><ul><li>Computationally complex if many features </li></ul></ul><ul><ul><li>But a clever trick exists </li></ul></ul>
  50. 54. Summary <ul><li>Maximize the margin between positive and negative examples (connects to PAC theory) </li></ul><ul><li>Non-linear Classification </li></ul><ul><li>The support vectors contribute to the solution </li></ul><ul><li>Kernels map examples into a new, usually non-linear space </li></ul>
  51. 55. References <ul><li>Vladimir Vapnik. The Nature of Statistical Learning Theory , Springer, 1995 </li></ul><ul><li>Andrew W. Moore. cmsc726: SVMs. http:// www.cs.cmu.edu/~awm/tutorials </li></ul><ul><li>C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html </li></ul><ul><li>Vladimir Vapnik. Statistical Learning Theory. Wiley-Interscience; 1998 </li></ul><ul><li>Thorsten Joachims (joachims_01a): A Statistical Learning Model of Text Classification for Support Vector Machines </li></ul>
  52. 56. www.intsci.ac.cn/shizz / <ul><li>Questions?! </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×