Advanced Computing Seminar Data Mining and Its Industrial Applications — Chapter 8 — Support Vector Machines
Zhongzhi Shi, Markus Stumptner, Yalei Hao, Gerald Quirchmayr
Knowledge and Software Engineering Lab
Advanced Computing Research Centre
School of Computer and Information Science
University of South Australia
Outline
Introduction
Support Vector Machine
Non-linear Classification
SVM and PAC
Applications
Summary
History
SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis
SVMs introduced by Boser, Guyon, Vapnik in COLT-92
Initially popularized in the NIPS community, now an important and active field of all Machine Learning research.
Special issues of Machine Learning Journal, and Journal of Machine Learning Research.
What is SVM?
SVMs are learning systems that
use a hypothesis space of linear functions
in a high dimensional feature space — Kernel function
trained with a learning algorithm from optimization theory — Lagrange
Implements a learning bias derived from statistical learning theory — Generalisation SVM is a classifier derived from statistical learning theory by Vapnik and Chervonenkis
Binary classification is frequently performed by using a real-valued hypothesis function:
The input x is assigned to the positive class, if
Otherwise to the negative class.
The concept of Hyperplane
For a binary linear separable training set, we can find at least a hyperplane (w,b) which divides the space into two half spaces.
The definition of hyperplane
Tuning the Hyperplane (w,b)
The Perceptron Algorithm
Proposed by Frank Rosenblatt in 1956
Preliminary definition
The functional margin of an example (x i ,y i )
implies correct classification of (x i ,y i )
The Perceptron Algorithm
The number of mistakes is at most
The Geometric margin ->
The Euclidean distance of an example (x i ,y i ) from the decision boundary
The Geometric margin
The margin of a training set S
Maximal Margin Hyperplane
A hyperplane realising the maximun geometric margin
The optimal linear classifier
If it can form the Maximal Margin Hyperplane.
How to Find the optimal solution ?
The drawback of the perceptron algorithm
The algorithm may give a different solution depending on the order in which the examples are processed.
The superiority of SVM
The kind of learning machines tune the solution based on the optimization theory .
The Maximal Margin Classifier
The simplest model of SVM
Finds the maximal margin hyperplane in an chosen kernel-induced feature space.
A convex optimization problem
Minimizing a quadratic function under linear inequality constrains
Support Vector Classifiers
Support vector machines
Cortes and Vapnik (1995)
well suited for high-dimensional data
binary classification
Training set
D = {( x i ,y i ), i=1,…,n}, x i R m and y i {-1,1}
Linear discriminant classifier
Separating hyperplane
{ x : g( x ) = w T x + w 0 = 0 }
model parameters: w R m and w 0 R
Formalizi the geometric margin
Assumes that
The geometric margin
In order to find the maximum ,we must find the minimum
Minimizing the norm ->
Because
We can re-formalize the optimization problem
Minimizing the norm ->
Uses the Lagrangian function
Obtained
Resubstituting into the primal to obtain
Minimizing the norm
Finds the minimum is equivalent to find the maximum
The strategies for minimizing differentiable function
Decomposition
Sequential Minimal Optimization (SMO)
The Support Vector
The condition of the optimization problem states that
This implies that only for input xi for which the functional margin is one
This implies that it lies closest to the hyperplane
The corresponding
The optimal hypothesis (w,b)
The two parameters can be obtained from
The hypothesis is
Soft Margin Optimization
The main problem with the maximal margin classifier is that it always products perfectly a consistent hypothesis
a hypothesis with no training error
Relax the boundary
Non-linear Classification
The problem
The maximal margin classifier is an important concept, but it cannot be used in many real-world problems
There will in general be no linear separation in the feature space.
The solution
Maps the data into another space that can be separated linearly.
A learning machine
A learning machine f takes an input x and transforms it, somehow using weights , into a predicted output y est = +/- 1
f x y est is some vector of adjustable parameters
Some definitions
Given some machine f
And under the assumption that all training points (x k ,y k ) were drawn i.i.d from some distribution.
And under the assumption that future test points will be drawn from the same distribution
Define
Official terminology
Some definitions
Given some machine f
And under the assumption that all training points (x k ,y k ) were drawn i.i.d from some distribution.
And under the assumption that future test points will be drawn from the same distribution
Define
Official terminology R = #training set data points
Vapnik-Chervonenkis Dimension
Given some machine f , let h be its VC dimension.
h is a measure of f ’s power ( h does not depend on the choice of training set)
Vapnik showed that with probability 1-
This gives us a way to estimate the error on future data based only on the training error and the VC-dimension of f
Structural Risk Minimization
Let (f) = the set of functions representable by f.
Suppose
Then
We’re trying to decide which machine to use.
We train each machine and make a table…
f 4 4 f 5 5 f 6 6 f 3 3 f 2 2 f 1 1 Choice Probable upper bound on TESTERR VC-Conf TRAINERR f i i
Kernel-Induced Feature Space
Mapping the data of space X into space F
Implicit Mapping into Feature Space
For the non-linear separable data set, we can modify the hypothesis to map implicitly the data to another feature space
Kernel Function
A Kernel is a function K , such that for all
The benefits
Solve the computational problem of working with many dimensions
Kernel function
The Polynomial Kernel
The kind of kernel represents the inner product of two vector(point) in a feature space of dimension.
For example
Text Categorization Inductive learning Inpute : Output : f(x) = confidence(class) In the case of text classification ,the attribute are words in the document ,and the classes are the categories.
PROPERTIES OF TEXT-CLASSIFICATION TASKS
High-Dimensional Feature Space.
Sparse Document Vectors.
High Level of Redundancy.
Text representation and feature selection
Binary feature
term frequency
Inverse document frequency
n is the total number of documents
DF(w) is the number of documents the word occurs in
Learning SVMS
To learn the vector of feature weights
Linear SVMS
Polynomial classifiers
Radial basis functions
Processing
Text files are processed to produce a vector of words
Select 300 words with highest mutual information with each category(remove stopwords)
A separate classifier is learned for each category.
An example - Reuters ( trends & controversies)
Category : interest
Weight vector
large positive weights : prime (.70), rate (.67), interest (.63), rates (.60), and discount (.46)
large negative weights: group (–.24),year (–.25), sees (–.33) world (–.35), and dlrs (–.71)
Text Categorization Results Dumais et al. (1998)
Apply to the Linear Classifier
Substitutes to the hypothesis
Substitutes to the margin optimization
SVMs and PAC Learning
Theorems connect PAC theory to the size of the margin
Basically, the larger the margin, the better the expected accuracy
See, for example, Chapter 4 of Support Vector Machines by Christianini and Shawe-Taylor, Cambridge University Press, 2002
PAC and the Number of Support Vectors
The fewer the support vectors, the better the generalization will be
Recall, non-support vectors are
Correctly classified
Don’t change the learned model if left out of the training set
So
VC-dimension of an SVM
Very loosely speaking there is some theory which under some different assumptions puts an upper bound on the VC dimension as
where
Diameter is the diameter of the smallest sphere that can enclose all the high-dimensional term-vectors derived from the training set.
Margin is the smallest margin we’ll let the SVM use
This can be used in SRM (Structural Risk Minimization) for choosing the polynomial degree, RBF , etc.
Maximize the margin between positive and negative examples (connects to PAC theory)
Non-linear Classification
The support vectors contribute to the solution
Kernels map examples into a new, usually non-linear space
References
Vladimir Vapnik. The Nature of Statistical Learning Theory , Springer, 1995
Andrew W. Moore. cmsc726: SVMs. http:// www.cs.cmu.edu/~awm/tutorials
C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html
Vladimir Vapnik. Statistical Learning Theory. Wiley-Interscience; 1998
Thorsten Joachims (joachims_01a): A Statistical Learning Model of Text Classification for Support Vector Machines
0 comments
Post a comment