SlideShare a Scribd company logo
Introduction to Support Vector Machines
Colin Campbell,
Bristol University
1
Outline of talk.
Part 1. An Introduction to SVMs
1.1. SVMs for binary classification.
1.2. Soft margins and multi-class classification.
1.3. SVMs for regression.
2
Part 2. General kernel methods
2.1 Kernel methods based on linear programming and
other approaches.
2.2 Training SVMs and other kernel machines.
2.3 Model Selection.
2.4 Different types of kernels.
2.5. SVMs in practice
3
Advantages of SVMs
ˆ A principled approach to classification, regression or
novelty detection tasks.
ˆ SVMs exhibit good generalisation.
ˆ Hypothesis has an explicit dependence on the data
(via the support vectors). Hence can readily
interpret the model.
4
ˆ Learning involves optimisation of a convex function
(no false minima, unlike a neural network).
ˆ Few parameters required for tuning the learning
machine (unlike neural network where the
architecture and various parameters must be found).
ˆ Can implement confidence measures, etc.
5
1.1 SVMs for Binary Classification.
Preliminaries: Consider a binary classification problem:
input vectors are xi and yi = ±1 are the targets or labels.
The index i labels the pattern pairs (i = 1, . . . , m).
The xi define a space of labelled points called input space.
6
From the perspective of statistical learning theory the
motivation for considering binary classifier SVMs comes
from theoretical bounds on the generalization error.
These generalization bounds have two important features:
7
1. the upper bound on the generalization error does not
depend on the dimensionality of the space.
2. the bound is minimized by maximizing the margin, γ,
i.e. the minimal distance between the hyperplane
separating the two classes and the closest datapoints of
each class.
8
Separating hyperplane
Margin
9
In an arbitrary-dimensional space a separating
hyperplane can be written:
w · x + b = 0
where b is the bias, and w the weights, etc.
Thus we will consider a decision function of the form:
D(x) = sign (w · x + b)
10
We note that the argument in D(x) is invariant under a
rescaling: w → λw, b → λb.
We will implicitly fix a scale with:
11
w · x + b = 1
w · x + b = −1
for the support vectors (canonical hyperplanes).
12
Thus:
w · (x1 − x2) = 2
For two support vectors on each side of the separating
hyperplane.
13
The margin will be given by the projection of the vector
(x1 − x2) onto the normal vector to the hyperplane i.e.
w/||w|| from which we deduce that the margin is given
by γ = 1/ ||w||2.
14
Separating hyperplane
Margin
1
2
15
Maximisation of the margin is thus equivalent to
minimisation of the functional:
Φ(w) =
1
2
(w · w)
subject to the constraints:
yi [(w · xi) + b] ≥ 1
16
Thus the task is to find an optimum of the primal
objective function:
L(w, b) =
1
2
(w · w) −
m
i=1
αi [yi((w · xi) + b) − 1]
Solving the saddle point equations ∂L/∂b = 0 gives:
m
i=1
αiyi = 0
17
and ∂L/∂w = 0 gives:
w =
m
i=1
αiyixi
which when substituted back in L(w, α , α) tells us that
we should maximise the functional (the Wolfe dual):
18
W(α) =
m
i=1
αi −
1
2
m
i,j=1
αiαjyiyj (xi · xj)
subject to the constraints:
αi ≥ 0
(they are Lagrange multipliers)
19
and:
m
i=1
αiyi = 0
The decision function is then:
D(z) = sign


m
j=1
αjyj (xj · z) + b


20
So far we don’t know how to handle non-separable
datasets.
To answer this problem we will exploit the second point
mentioned above: the theoretical generalisation bounds
do not depend on the dimensionality of the space.
21
For the dual objective function we notice that the
datapoints, xi, only appear inside an inner product.
To get a better representation of the data we can
therefore map the datapoints into an alternative higher
dimensional space through a replacement:
xi · xj → φ (xi) · φ(xj)
22
i.e. we have used a mapping xi → φ( xi).
This higher dimensional space is called Feature Space and
must be a Hilbert Space (the concept of an inner product
applies).
23
The function φ(xi) · φ(xj) will be called a kernel, so:
φ(xi) · φ(xj) = K(xi, xj)
The kernel is therefore the inner product between
mapped pairs of points in Feature Space.
24
What is the mapping relation?
In fact we do not need to know the form of this mapping
since it will be implicitly defined by the functional form
of the mapped inner product in Feature Space.
25
Different choices for the kernel K(x, x ) define different
Hilbert spaces to use. A large number of Hilbert spaces
are possible with RBF kernels:
K(x, x ) = e−||x−x ||2/2σ2
and polynomial kernels:
K(x, x ) = (< x, x > +1)d
being popular choices.
26
Legitimate kernels are not restricted to functions e.g.
they may be defined by an algorithm, for example.
(1) String kernels
Consider the following text strings car, cat, cart, chart.
They have certain similarities despite being of unequal
length. dug is of equal length to car and cat but differs
in all letters.
27
A measure of the degree of alignment of two text strings
or sequences can be found using algorithmic techniques
e.g. edit codes (dynamic programming).
Frequent applications in
ˆ bioinformatics e.g. Smith-Waterman,
Waterman-Eggert algorithm (4-digit strings ...
ACCGTATGTAAA ... from genomes),
ˆ text processing (e.g. WWW), etc.
28
(2) Kernels for graphs/networks
Diffusion kernel ( Kondor and Lafferty)
29
Mathematically, valid kernel functions should satisfy
Mercer’s condition:
30
For any g(x) for which:
g(x)2
dx < ∞
it must be the case that:
K(x, x )g(x)g(x )dxdx ≥ 0
A simple criterion is that the kernel should be positive
semi-definite.
31
Theorem: If a kernel is positive semi-definite i.e.:
i,j
K(xi, xj)cicj ≥ 0
where {c1, . . . , cn} are real numbers, then there exists a
function φ(x) defining an inner product of possibly
higher dimension i.e.:
K(x, y) = φ(x) · φ(y)
32
Thus the following steps are used to train an SVM:
1. Choose kernel function K(xi, xj).
2. Maximise:
W(α) =
m
i=1
αi −
1
2
m
i=1
αiαjyiyjK(xi, xj)
subject to αi ≥ 0 and i αiyi = 0.
33
3. The bias b is found as follows:
b =
1
2

min


{i|yi=+1}
αiyiK(xi, xj)


+ max


{i|yi=−1}
αiyiK(xi, xj)




34
4. The optimal αi go into the decision function:
D(z) = sign
m
i=1
αiyiK(xi, z) + b
35
1.2. Soft Margins and Multi-class problems
Multi-class classification: Many problems involve
multiclass classification and a number of schemes have
been outlined.
One of the simplest schemes is to use a directed acyclic
graph (DAG) with the learning task reduced to binary
classification at each node:
36
1/3
2/3
321
1/2
37
Soft margins: Most real life datasets contain noise and
an SVM can fit this noise leading to poor generalization.
The effect of outliers and noise can be reduced by
introducing a soft margin. Two schemes are commonly
used:
38
L1 error norm:
0 ≤ αi ≤ C
L2 error norm:
K(xi, xi) ← K(xi, xi) + λ
39
5.4
5.6
5.8
6
6.2
6.4
6.6
6.8
7
7.2
0 2 4 6 8 10
40
5
5.5
6
6.5
7
7.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
41
The justification for these soft margin techniques comes
from statistical learning theory.
Can be readily viewed as relaxation of the hard margin
constraint.
42
For the L1 error norm (prior to introducing kernels) we
introduce a positive slack variable ξi:
yi (w · xi + b) ≥ 1 − ξi
so minimize the sum of errors m
i=1 ξi in addition to
||w||2
:
min
1
2
w · w + C
m
i=1
ξi
43
This is readily formulated as a primal objective function :
L(w, b, α, ξ) =
1
2
w · w + C
m
i=1
ξi
−
m
i=1
αi [yi (w · xi + b) − 1 + ξi] −
m
i=1
riξi
with Lagrange multipliers αi ≥ 0 and ri ≥ 0.
44
The derivatives with respect to w, b and ξ give:
∂L
∂w
= w −
m
i=1
αiyixi = 0 (1)
∂L
∂b
=
m
i=1
αiyi = 0 (2)
∂L
∂ξi
= C − αi − ri = 0 (3)
45
Resubstituting back in the primal objective function
obtain the same dual objective function as before.
However, ri ≥ 0 and C − αi − ri = 0, hence αi ≤ C and
the constraint 0 ≤ αi is replaced by 0 ≤ αi ≤ C.
46
Patterns with values 0 < αi < C will be referred to as
non-bound and those with αi = 0 or αi = C will be said
to be at bound.
The optimal value of C must be found by
experimentation using a validation set and it cannot be
readily related to the characteristics of the dataset or
model.
47
In an alternative approach, ν-SVM, it can be shown that
solutions for an L1-error norm are the same as those
obtained from maximizing:
W(α) = −
1
2
m
i,j=1
αiαjyiyjK(xi, xj)
48
subject to:
m
i=1
yiαi = 0
m
i=1
αi ≥ ν 0 ≤ αi ≤
1
m
(4)
where ν lies on the range 0 to 1.
49
The fraction of training errors is upper bounded by ν and
ν also provides a lower bound on the fraction of points
which are support vectors.
Hence in this formulation the conceptual meaning of the
soft margin parameter is more transparent.
50
For the L2 error norm the primal objective function is:
L(w, b, α, ξ) =
1
2
w · w + C
m
i=1
ξ2
i
−
m
i=1
αi [yi (w · xi + b) − 1 + ξi] −
m
i=1
riξi(5)
with αi ≥ 0 and ri ≥ 0.
51
After obtaining the derivatives with respect to w, b and
ξ, substituting for w and ξ in the primal objective
function and noting that the dual objective function is
maximal when ri = 0, we obtain the following dual
objective function after kernel substitution:
W(α) =
m
i=1
αi −
1
2
m
i,j=1
yiyjαiαjK(xi, xj) −
1
4C
m
i=1
α2
i
52
With λ = 1/2C this gives the same dual objective
function as for hard margin learning except for the
substitution:
K(xi, xi) ← K(xi, xi) + λ
53
For many real-life datasets there is an imbalance between
the amount of data in different classes, or the significance
of the data in the two classes can be quite different. The
relative balance between the detection rate for different
classes can be easily shifted by introducing asymmetric
soft margin parameters.
54
Thus for binary classification with an L1 error norm:
0 ≤ αi ≤ C+
for yi = +1), and
0 ≤ αi ≤ C−,
for yi = −1, etc.
55
1.3. Solving Regression Tasks with SVMs
-SV regression (Vapnik): not unique approach.
Statistical learning theory: we use constraints
yi − w · xi − b ≤ and w · xi + b − yi ≤ to allow for
some deviation between the eventual targets yi and the
function f(x) = w · x + b, modelling the data.
56
As before we minimize ||w||2
to penalise overcomplexity.
To account for training errors we also introduce slack
variables ξi, ξi for the two types of training error.
These slack variables are zero for points inside the tube
and progressively increase for points outside the tube
according to the loss function used.
57
L
0 ε-ε
Figure 1: A linear -insensitive loss function
58
For a linear −insensitive loss function the task is
therefore to minimize:
min ||w||2
+ C
m
i=1
ξi + ξi
59
subject to
yi − w · xi − b ≤ + ξi
(w · xi + b) − yi ≤ + ξi
where the slack variables are both positive.
60
Deriving the dual objective function we get:
W(α, α) =
m
i=1
yi(αi − αi) −
m
i=1
(αi + αi)
−
1
2
m
i,j=1
(αi − αi)(αj − αj)K(xi, xj)
61
which is maximized subject to
m
i=1
αi =
m
i=1
αi
and:
0 ≤ αi ≤ C 0 ≤ αi ≤ C
62
The function modelling the data is then:
f(z) =
m
i=1
(αi − αi)K(xi, z) + b
63
Similarly we can define many other types of loss
functions.
We can define a linear programming approach to
regression.
We can use many other approaches to regression (e.g.
kernelising least squares).
64

More Related Content

What's hot

Hebb network
Hebb networkHebb network
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function
범준 김
 
Lect3 bresenham circlesandpolygons
Lect3 bresenham circlesandpolygonsLect3 bresenham circlesandpolygons
Lect3 bresenham circlesandpolygons
BCET
 
Lecture9 multi kernel_svm
Lecture9 multi kernel_svmLecture9 multi kernel_svm
Lecture9 multi kernel_svmStéphane Canu
 
23 industrial engineering
23 industrial engineering23 industrial engineering
23 industrial engineeringmloeb825
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
Devashish Patel
 
Measures of risk on variability with application in stochastic activity networks
Measures of risk on variability with application in stochastic activity networksMeasures of risk on variability with application in stochastic activity networks
Measures of risk on variability with application in stochastic activity networksAlexander Decker
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Eun Ji Lee
 
Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Palak Sanghani
 
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
홍배 김
 
Lecture 11 (Digital Image Processing)
Lecture 11 (Digital Image Processing)Lecture 11 (Digital Image Processing)
Lecture 11 (Digital Image Processing)
VARUN KUMAR
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic track
arogozhnikov
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
arogozhnikov
 
BITS C464.doc
BITS C464.docBITS C464.doc
BITS C464.docbutest
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Universitat Politècnica de Catalunya
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
홍배 김
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017
Fred J. Hickernell
 
CP 2011 Poster
CP 2011 PosterCP 2011 Poster
CP 2011 Poster
SAAM007
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
홍배 김
 

What's hot (20)

Hebb network
Hebb networkHebb network
Hebb network
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function
 
Lect3 bresenham circlesandpolygons
Lect3 bresenham circlesandpolygonsLect3 bresenham circlesandpolygons
Lect3 bresenham circlesandpolygons
 
Lecture9 multi kernel_svm
Lecture9 multi kernel_svmLecture9 multi kernel_svm
Lecture9 multi kernel_svm
 
23 industrial engineering
23 industrial engineering23 industrial engineering
23 industrial engineering
 
gfg
gfggfg
gfg
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
Measures of risk on variability with application in stochastic activity networks
Measures of risk on variability with application in stochastic activity networksMeasures of risk on variability with application in stochastic activity networks
Measures of risk on variability with application in stochastic activity networks
 
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual AttentionShow, Attend and Tell: Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
 
Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]
 
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
 
Lecture 11 (Digital Image Processing)
Lecture 11 (Digital Image Processing)Lecture 11 (Digital Image Processing)
Lecture 11 (Digital Image Processing)
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic track
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
BITS C464.doc
BITS C464.docBITS C464.doc
BITS C464.doc
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017
 
CP 2011 Poster
CP 2011 PosterCP 2011 Poster
CP 2011 Poster
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 

Similar to Epsrcws08 campbell isvm_01

A Simple Review on SVM
A Simple Review on SVMA Simple Review on SVM
A Simple Review on SVM
Honglin Yu
 
4.Support Vector Machines.ppt machine learning and development
4.Support Vector Machines.ppt machine learning and development4.Support Vector Machines.ppt machine learning and development
4.Support Vector Machines.ppt machine learning and development
PriyankaRamavath3
 
lecture14-SVMs (1).ppt
lecture14-SVMs (1).pptlecture14-SVMs (1).ppt
lecture14-SVMs (1).ppt
muqadsatareen
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
NaglaaAbdelhady
 
course slides of Support-Vector-Machine.pdf
course slides of Support-Vector-Machine.pdfcourse slides of Support-Vector-Machine.pdf
course slides of Support-Vector-Machine.pdf
onurenginar1
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Beniamino Murgante
 
SVM (2).ppt
SVM (2).pptSVM (2).ppt
SVM (2).ppt
NoorUlHaq47
 
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
The Statistical and Applied Mathematical Sciences Institute
 
Svm my
Svm mySvm my
Supporting Vector Machine
Supporting Vector MachineSupporting Vector Machine
Supporting Vector Machine
Sumit Singh
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
Ujjawal
 
Gentle intro to SVM
Gentle intro to SVMGentle intro to SVM
Gentle intro to SVM
Zoya Bylinskii
 
Support Vector Machine.ppt
Support Vector Machine.pptSupport Vector Machine.ppt
Support Vector Machine.ppt
NBACriteria2SICET
 
svm.ppt
svm.pptsvm.ppt
svm.ppt
RanjithaM32
 
support vector machine algorithm in machine learning
support vector machine algorithm in machine learningsupport vector machine algorithm in machine learning
support vector machine algorithm in machine learning
SamGuy7
 
1551 limits and continuity
1551 limits and continuity1551 limits and continuity
1551 limits and continuity
Dr Fereidoun Dejahang
 
Dl1 deep learning_algorithms
Dl1 deep learning_algorithmsDl1 deep learning_algorithms
Dl1 deep learning_algorithmsArmando Vieira
 

Similar to Epsrcws08 campbell isvm_01 (20)

A Simple Review on SVM
A Simple Review on SVMA Simple Review on SVM
A Simple Review on SVM
 
4.Support Vector Machines.ppt machine learning and development
4.Support Vector Machines.ppt machine learning and development4.Support Vector Machines.ppt machine learning and development
4.Support Vector Machines.ppt machine learning and development
 
lecture14-SVMs (1).ppt
lecture14-SVMs (1).pptlecture14-SVMs (1).ppt
lecture14-SVMs (1).ppt
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
 
course slides of Support-Vector-Machine.pdf
course slides of Support-Vector-Machine.pdfcourse slides of Support-Vector-Machine.pdf
course slides of Support-Vector-Machine.pdf
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
 
support vector machine
support vector machinesupport vector machine
support vector machine
 
SVM (2).ppt
SVM (2).pptSVM (2).ppt
SVM (2).ppt
 
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
 
Svm my
Svm mySvm my
Svm my
 
Svm my
Svm mySvm my
Svm my
 
Supporting Vector Machine
Supporting Vector MachineSupporting Vector Machine
Supporting Vector Machine
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Gentle intro to SVM
Gentle intro to SVMGentle intro to SVM
Gentle intro to SVM
 
Support Vector Machine.ppt
Support Vector Machine.pptSupport Vector Machine.ppt
Support Vector Machine.ppt
 
svm.ppt
svm.pptsvm.ppt
svm.ppt
 
support vector machine algorithm in machine learning
support vector machine algorithm in machine learningsupport vector machine algorithm in machine learning
support vector machine algorithm in machine learning
 
SVM.ppt
SVM.pptSVM.ppt
SVM.ppt
 
1551 limits and continuity
1551 limits and continuity1551 limits and continuity
1551 limits and continuity
 
Dl1 deep learning_algorithms
Dl1 deep learning_algorithmsDl1 deep learning_algorithms
Dl1 deep learning_algorithms
 

More from Cheng Feng

Strata singapore survey
Strata singapore surveyStrata singapore survey
Strata singapore survey
Cheng Feng
 
Sparkcamp stratasingapore
Sparkcamp stratasingaporeSparkcamp stratasingapore
Sparkcamp stratasingapore
Cheng Feng
 
运营商去O浅析 公开版-王晓征
运营商去O浅析 公开版-王晓征运营商去O浅析 公开版-王晓征
运营商去O浅析 公开版-王晓征
Cheng Feng
 
数据库架构师做什么 58同城数据库架构设计思路-沈剑
数据库架构师做什么 58同城数据库架构设计思路-沈剑数据库架构师做什么 58同城数据库架构设计思路-沈剑
数据库架构师做什么 58同城数据库架构设计思路-沈剑
Cheng Feng
 
Tdsql在微众银行核心交易系统中的实践 雷海林
Tdsql在微众银行核心交易系统中的实践 雷海林Tdsql在微众银行核心交易系统中的实践 雷海林
Tdsql在微众银行核心交易系统中的实践 雷海林
Cheng Feng
 
Maria db新特性剖析 京东张金鹏
Maria db新特性剖析 京东张金鹏Maria db新特性剖析 京东张金鹏
Maria db新特性剖析 京东张金鹏
Cheng Feng
 
Inception自动审核系统设计与实现 王竹峰
Inception自动审核系统设计与实现 王竹峰Inception自动审核系统设计与实现 王竹峰
Inception自动审核系统设计与实现 王竹峰
Cheng Feng
 
Maria db新特性剖析 京东张金鹏
Maria db新特性剖析 京东张金鹏Maria db新特性剖析 京东张金鹏
Maria db新特性剖析 京东张金鹏
Cheng Feng
 
Epsrcws08 campbell kbm_01
Epsrcws08 campbell kbm_01Epsrcws08 campbell kbm_01
Epsrcws08 campbell kbm_01
Cheng Feng
 

More from Cheng Feng (9)

Strata singapore survey
Strata singapore surveyStrata singapore survey
Strata singapore survey
 
Sparkcamp stratasingapore
Sparkcamp stratasingaporeSparkcamp stratasingapore
Sparkcamp stratasingapore
 
运营商去O浅析 公开版-王晓征
运营商去O浅析 公开版-王晓征运营商去O浅析 公开版-王晓征
运营商去O浅析 公开版-王晓征
 
数据库架构师做什么 58同城数据库架构设计思路-沈剑
数据库架构师做什么 58同城数据库架构设计思路-沈剑数据库架构师做什么 58同城数据库架构设计思路-沈剑
数据库架构师做什么 58同城数据库架构设计思路-沈剑
 
Tdsql在微众银行核心交易系统中的实践 雷海林
Tdsql在微众银行核心交易系统中的实践 雷海林Tdsql在微众银行核心交易系统中的实践 雷海林
Tdsql在微众银行核心交易系统中的实践 雷海林
 
Maria db新特性剖析 京东张金鹏
Maria db新特性剖析 京东张金鹏Maria db新特性剖析 京东张金鹏
Maria db新特性剖析 京东张金鹏
 
Inception自动审核系统设计与实现 王竹峰
Inception自动审核系统设计与实现 王竹峰Inception自动审核系统设计与实现 王竹峰
Inception自动审核系统设计与实现 王竹峰
 
Maria db新特性剖析 京东张金鹏
Maria db新特性剖析 京东张金鹏Maria db新特性剖析 京东张金鹏
Maria db新特性剖析 京东张金鹏
 
Epsrcws08 campbell kbm_01
Epsrcws08 campbell kbm_01Epsrcws08 campbell kbm_01
Epsrcws08 campbell kbm_01
 

Epsrcws08 campbell isvm_01

  • 1. Introduction to Support Vector Machines Colin Campbell, Bristol University 1
  • 2. Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification. 1.3. SVMs for regression. 2
  • 3. Part 2. General kernel methods 2.1 Kernel methods based on linear programming and other approaches. 2.2 Training SVMs and other kernel machines. 2.3 Model Selection. 2.4 Different types of kernels. 2.5. SVMs in practice 3
  • 4. Advantages of SVMs ˆ A principled approach to classification, regression or novelty detection tasks. ˆ SVMs exhibit good generalisation. ˆ Hypothesis has an explicit dependence on the data (via the support vectors). Hence can readily interpret the model. 4
  • 5. ˆ Learning involves optimisation of a convex function (no false minima, unlike a neural network). ˆ Few parameters required for tuning the learning machine (unlike neural network where the architecture and various parameters must be found). ˆ Can implement confidence measures, etc. 5
  • 6. 1.1 SVMs for Binary Classification. Preliminaries: Consider a binary classification problem: input vectors are xi and yi = ±1 are the targets or labels. The index i labels the pattern pairs (i = 1, . . . , m). The xi define a space of labelled points called input space. 6
  • 7. From the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error. These generalization bounds have two important features: 7
  • 8. 1. the upper bound on the generalization error does not depend on the dimensionality of the space. 2. the bound is minimized by maximizing the margin, γ, i.e. the minimal distance between the hyperplane separating the two classes and the closest datapoints of each class. 8
  • 10. In an arbitrary-dimensional space a separating hyperplane can be written: w · x + b = 0 where b is the bias, and w the weights, etc. Thus we will consider a decision function of the form: D(x) = sign (w · x + b) 10
  • 11. We note that the argument in D(x) is invariant under a rescaling: w → λw, b → λb. We will implicitly fix a scale with: 11
  • 12. w · x + b = 1 w · x + b = −1 for the support vectors (canonical hyperplanes). 12
  • 13. Thus: w · (x1 − x2) = 2 For two support vectors on each side of the separating hyperplane. 13
  • 14. The margin will be given by the projection of the vector (x1 − x2) onto the normal vector to the hyperplane i.e. w/||w|| from which we deduce that the margin is given by γ = 1/ ||w||2. 14
  • 16. Maximisation of the margin is thus equivalent to minimisation of the functional: Φ(w) = 1 2 (w · w) subject to the constraints: yi [(w · xi) + b] ≥ 1 16
  • 17. Thus the task is to find an optimum of the primal objective function: L(w, b) = 1 2 (w · w) − m i=1 αi [yi((w · xi) + b) − 1] Solving the saddle point equations ∂L/∂b = 0 gives: m i=1 αiyi = 0 17
  • 18. and ∂L/∂w = 0 gives: w = m i=1 αiyixi which when substituted back in L(w, α , α) tells us that we should maximise the functional (the Wolfe dual): 18
  • 19. W(α) = m i=1 αi − 1 2 m i,j=1 αiαjyiyj (xi · xj) subject to the constraints: αi ≥ 0 (they are Lagrange multipliers) 19
  • 20. and: m i=1 αiyi = 0 The decision function is then: D(z) = sign   m j=1 αjyj (xj · z) + b   20
  • 21. So far we don’t know how to handle non-separable datasets. To answer this problem we will exploit the second point mentioned above: the theoretical generalisation bounds do not depend on the dimensionality of the space. 21
  • 22. For the dual objective function we notice that the datapoints, xi, only appear inside an inner product. To get a better representation of the data we can therefore map the datapoints into an alternative higher dimensional space through a replacement: xi · xj → φ (xi) · φ(xj) 22
  • 23. i.e. we have used a mapping xi → φ( xi). This higher dimensional space is called Feature Space and must be a Hilbert Space (the concept of an inner product applies). 23
  • 24. The function φ(xi) · φ(xj) will be called a kernel, so: φ(xi) · φ(xj) = K(xi, xj) The kernel is therefore the inner product between mapped pairs of points in Feature Space. 24
  • 25. What is the mapping relation? In fact we do not need to know the form of this mapping since it will be implicitly defined by the functional form of the mapped inner product in Feature Space. 25
  • 26. Different choices for the kernel K(x, x ) define different Hilbert spaces to use. A large number of Hilbert spaces are possible with RBF kernels: K(x, x ) = e−||x−x ||2/2σ2 and polynomial kernels: K(x, x ) = (< x, x > +1)d being popular choices. 26
  • 27. Legitimate kernels are not restricted to functions e.g. they may be defined by an algorithm, for example. (1) String kernels Consider the following text strings car, cat, cart, chart. They have certain similarities despite being of unequal length. dug is of equal length to car and cat but differs in all letters. 27
  • 28. A measure of the degree of alignment of two text strings or sequences can be found using algorithmic techniques e.g. edit codes (dynamic programming). Frequent applications in ˆ bioinformatics e.g. Smith-Waterman, Waterman-Eggert algorithm (4-digit strings ... ACCGTATGTAAA ... from genomes), ˆ text processing (e.g. WWW), etc. 28
  • 29. (2) Kernels for graphs/networks Diffusion kernel ( Kondor and Lafferty) 29
  • 30. Mathematically, valid kernel functions should satisfy Mercer’s condition: 30
  • 31. For any g(x) for which: g(x)2 dx < ∞ it must be the case that: K(x, x )g(x)g(x )dxdx ≥ 0 A simple criterion is that the kernel should be positive semi-definite. 31
  • 32. Theorem: If a kernel is positive semi-definite i.e.: i,j K(xi, xj)cicj ≥ 0 where {c1, . . . , cn} are real numbers, then there exists a function φ(x) defining an inner product of possibly higher dimension i.e.: K(x, y) = φ(x) · φ(y) 32
  • 33. Thus the following steps are used to train an SVM: 1. Choose kernel function K(xi, xj). 2. Maximise: W(α) = m i=1 αi − 1 2 m i=1 αiαjyiyjK(xi, xj) subject to αi ≥ 0 and i αiyi = 0. 33
  • 34. 3. The bias b is found as follows: b = 1 2  min   {i|yi=+1} αiyiK(xi, xj)   + max   {i|yi=−1} αiyiK(xi, xj)     34
  • 35. 4. The optimal αi go into the decision function: D(z) = sign m i=1 αiyiK(xi, z) + b 35
  • 36. 1.2. Soft Margins and Multi-class problems Multi-class classification: Many problems involve multiclass classification and a number of schemes have been outlined. One of the simplest schemes is to use a directed acyclic graph (DAG) with the learning task reduced to binary classification at each node: 36
  • 38. Soft margins: Most real life datasets contain noise and an SVM can fit this noise leading to poor generalization. The effect of outliers and noise can be reduced by introducing a soft margin. Two schemes are commonly used: 38
  • 39. L1 error norm: 0 ≤ αi ≤ C L2 error norm: K(xi, xi) ← K(xi, xi) + λ 39
  • 41. 5 5.5 6 6.5 7 7.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 41
  • 42. The justification for these soft margin techniques comes from statistical learning theory. Can be readily viewed as relaxation of the hard margin constraint. 42
  • 43. For the L1 error norm (prior to introducing kernels) we introduce a positive slack variable ξi: yi (w · xi + b) ≥ 1 − ξi so minimize the sum of errors m i=1 ξi in addition to ||w||2 : min 1 2 w · w + C m i=1 ξi 43
  • 44. This is readily formulated as a primal objective function : L(w, b, α, ξ) = 1 2 w · w + C m i=1 ξi − m i=1 αi [yi (w · xi + b) − 1 + ξi] − m i=1 riξi with Lagrange multipliers αi ≥ 0 and ri ≥ 0. 44
  • 45. The derivatives with respect to w, b and ξ give: ∂L ∂w = w − m i=1 αiyixi = 0 (1) ∂L ∂b = m i=1 αiyi = 0 (2) ∂L ∂ξi = C − αi − ri = 0 (3) 45
  • 46. Resubstituting back in the primal objective function obtain the same dual objective function as before. However, ri ≥ 0 and C − αi − ri = 0, hence αi ≤ C and the constraint 0 ≤ αi is replaced by 0 ≤ αi ≤ C. 46
  • 47. Patterns with values 0 < αi < C will be referred to as non-bound and those with αi = 0 or αi = C will be said to be at bound. The optimal value of C must be found by experimentation using a validation set and it cannot be readily related to the characteristics of the dataset or model. 47
  • 48. In an alternative approach, ν-SVM, it can be shown that solutions for an L1-error norm are the same as those obtained from maximizing: W(α) = − 1 2 m i,j=1 αiαjyiyjK(xi, xj) 48
  • 49. subject to: m i=1 yiαi = 0 m i=1 αi ≥ ν 0 ≤ αi ≤ 1 m (4) where ν lies on the range 0 to 1. 49
  • 50. The fraction of training errors is upper bounded by ν and ν also provides a lower bound on the fraction of points which are support vectors. Hence in this formulation the conceptual meaning of the soft margin parameter is more transparent. 50
  • 51. For the L2 error norm the primal objective function is: L(w, b, α, ξ) = 1 2 w · w + C m i=1 ξ2 i − m i=1 αi [yi (w · xi + b) − 1 + ξi] − m i=1 riξi(5) with αi ≥ 0 and ri ≥ 0. 51
  • 52. After obtaining the derivatives with respect to w, b and ξ, substituting for w and ξ in the primal objective function and noting that the dual objective function is maximal when ri = 0, we obtain the following dual objective function after kernel substitution: W(α) = m i=1 αi − 1 2 m i,j=1 yiyjαiαjK(xi, xj) − 1 4C m i=1 α2 i 52
  • 53. With λ = 1/2C this gives the same dual objective function as for hard margin learning except for the substitution: K(xi, xi) ← K(xi, xi) + λ 53
  • 54. For many real-life datasets there is an imbalance between the amount of data in different classes, or the significance of the data in the two classes can be quite different. The relative balance between the detection rate for different classes can be easily shifted by introducing asymmetric soft margin parameters. 54
  • 55. Thus for binary classification with an L1 error norm: 0 ≤ αi ≤ C+ for yi = +1), and 0 ≤ αi ≤ C−, for yi = −1, etc. 55
  • 56. 1.3. Solving Regression Tasks with SVMs -SV regression (Vapnik): not unique approach. Statistical learning theory: we use constraints yi − w · xi − b ≤ and w · xi + b − yi ≤ to allow for some deviation between the eventual targets yi and the function f(x) = w · x + b, modelling the data. 56
  • 57. As before we minimize ||w||2 to penalise overcomplexity. To account for training errors we also introduce slack variables ξi, ξi for the two types of training error. These slack variables are zero for points inside the tube and progressively increase for points outside the tube according to the loss function used. 57
  • 58. L 0 ε-ε Figure 1: A linear -insensitive loss function 58
  • 59. For a linear −insensitive loss function the task is therefore to minimize: min ||w||2 + C m i=1 ξi + ξi 59
  • 60. subject to yi − w · xi − b ≤ + ξi (w · xi + b) − yi ≤ + ξi where the slack variables are both positive. 60
  • 61. Deriving the dual objective function we get: W(α, α) = m i=1 yi(αi − αi) − m i=1 (αi + αi) − 1 2 m i,j=1 (αi − αi)(αj − αj)K(xi, xj) 61
  • 62. which is maximized subject to m i=1 αi = m i=1 αi and: 0 ≤ αi ≤ C 0 ≤ αi ≤ C 62
  • 63. The function modelling the data is then: f(z) = m i=1 (αi − αi)K(xi, z) + b 63
  • 64. Similarly we can define many other types of loss functions. We can define a linear programming approach to regression. We can use many other approaches to regression (e.g. kernelising least squares). 64