SlideShare a Scribd company logo
MIS 720 Electronic Business and Big Data
Infrastructures
Lecture 07: Support Vector Machines
Dr. Xialu Liu
Management Information Systems Department
San Diego State University
(MIS Department) Dr. Xialu Liu 1 / 32
Classification Problems
Classification is the problem of identifying which of a set of categories
an observation belongs to.
Given a feature matrix X and a qualitative response Y taking values
in the set S, the classification task is to build a function C(X) that
takes as input the feature vector Y and predicts its value for Y ; i.e.
C(X) ∈ S.
Often we are more interested in estimating the probabilities that Y
belongs to each category in S.
(MIS Department) Dr. Xialu Liu 2 / 32
Classification Problems
Classification is the problem of identifying which of a set of categories
an observation belongs to.
Given a feature matrix X and a qualitative response Y taking values
in the set S, the classification task is to build a function C(X) that
takes as input the feature vector Y and predicts its value for Y ; i.e.
C(X) ∈ S.
Often we are more interested in estimating the probabilities that Y
belongs to each category in S.
For example, it is more valuable to have an estimate of the probability
that a credit cardholder defaults, than a credit cardholder defaults or
not.
(MIS Department) Dr. Xialu Liu 2 / 32
Classification Problems
Here the response variable Y is qualitative - e.g. email is one of C = (spam, ham)
(ham=good email), digit class is one of S = {0, 1, ..., 9}. Our goals are to:
Build a classifier C(X) that assigns a class label from S to a future
unlabeled observation.
Assess the uncertainty in each classification
Understand the roles of the different predictors among
X = (X1, X2, . . . , Xp).
(MIS Department) Dr. Xialu Liu 3 / 32
Example: Credit Card Default
We are interested in predicting whether an individual will default on his or her credit
card payment, on the basis of annual income and monthly credit card balance. Default
data set contains annual income and monthly credit card balance for a subset of
10,000 individuals.
Blue circle represents ”not default” and orange pluses represents ”default”.
(MIS Department) Dr. Xialu Liu 4 / 32
Overview
We discuss support vector machine (SVM), an approach for classification that
was developed in the computer science community in the 1990s and that has
grown in popularity since then.
Maximal margin classifier: it is simple and elegant but it requires that the
classes be separable by a linear boundary.
Support vector classifier: it can be applied in a broader range of cases.
Support vector machine: it accommodates non-linear class boundaries.
Note: people often loosely refer to the maximal margin classifier, support vector
classifier, support vector machine as ”support vector machines”.
To avoid confusion, we will carefully distinguish between these three notions in
this lecture.
(MIS Department) Dr. Xialu Liu 5 / 32
Support Vector Machines
Here we approach the two-class classification problem in a direct way:
We try and find a hyperplane that separates the classes in feature space.
If we cannot, we get creative in two ways:
We soften what we mean by ”separates”, and
We enrich and enlarge the feature space so that separation is possible.
(MIS Department) Dr. Xialu Liu 6 / 32
Hyperplane
A hyperplane in p dimensions is a flat affine subspace of dimension p − 1.
For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace-
in other words, a line. A hyperplane is
β0 + β1X1 + β2X2 = 0
If a point X = (X1, X2) satisfies β0 + β1X1 + β2X2 = 0, it lies on the line; if it satisfies
β0 + β1X1 + β2X2 > 0, it lies above the line; if it satisfies β0 + β1X1 + β2X2 < 0, it
lies below the line.
(MIS Department) Dr. Xialu Liu 7 / 32
Hyperplane in 2 Dimensions
The following plot shows a hyperplane in a 2-dimensional space −6 + β1X1 + β2X2 = 0
(the blue line).
One point above the line is −6 + β1X1 + β2X2 = 1.6, and the other point below the
line is −6 + β1X1 + β2X2 = −4.
To determine a point is above the line or below the line, we need to calculate the inner
product of X = (X1, X2) and β = (β1, β2).
hX, βi = β1X1 + β2X2
(MIS Department) Dr. Xialu Liu 8 / 32
Hyperplane in 3 Dimensions
For three dimensions, a hyperplane is a flat two-dimensional plane. A hyperplane is
β0 + β1X1 + β2X2 + β3X3 = 0
If a point X = (X1, X2, X3) satisfies β0 + β1X1 + β2X2 + β3X3 = 0, it lies on
the plane; if it satisfies β0 + β1X1 + β2X2 + β3X3 > 0, it lies above the plane; if
it satisfies β0 + β1X1 + β2X2 + β3X3 < 0, it lies below the plane.
To determine a point is above the plane or below the plane, we need to calculate
the inner product of X = (X1, X2, X3) and β = (β1, β2, β3).
hX, βi = β1X1 + β2X2 + β3X3
(MIS Department) Dr. Xialu Liu 9 / 32
Hyperplane
In general the equation for a hyperplane has the form
β0 + β1X1 + β2X2 + . . . + βpXp = 0
It is hard to visualize the hyperplane, but the notion of a (p − 1)-dimensional flat
subspace still applies.
It divides the p-dimensional space into two halves.
If f(X) = β0 + β1X1 + . . . + βpXp, then f(X) > 0 for points on one side of the
hyperplane, and f(X) < 0 for points on the other.
The vector β = (β1, β2, . . . , βp) is called the normal vector - it points in a
direction orthogonal to the surface of a hyperplane.
(MIS Department) Dr. Xialu Liu 10 / 32
Separating Hyperplanes
If we code the colored points as Yi = +1 for blue, say, and Yi = −1 for purple,
then if Yi · f(Xi) > 0 for all i, f(X) = 0 defines a separating hyperplane.
(MIS Department) Dr. Xialu Liu 11 / 32
Maximal Margin Classifier
Suppose it is possible to construct a hyperplane that separates the data perfectly
according to their labels. Label observations from the blue class with yi = 1 and
from purple class with yi = −1, then
β0 + β1xi1 + β2xi2 + . . . + βpxip > 0, if yi = 1,
and
β0 + β1xi1 + β2xi2 + . . . + βpxip < 0, if yi = −1.
Then it means a hyperplane has the property that
yi(β0 + β1xi1 + β2xi2 + . . . + βpxip) > 0
for all i = 1, . . . , n.
(MIS Department) Dr. Xialu Liu 12 / 32
Separating Hyperplanes
If data can perfectly separated using a hyperplane, then there could exist an
infinite number of such hyperplane, shown in the plot.
Which one to choose?
(MIS Department) Dr. Xialu Liu 13 / 32
Maximal Margin Classifier
A natural choice is the maximal margin hyperplane, which is the separating plane
that is farthest from the data.
We compute the distance from
each data point to a given hyperplane; the smallest such distance is known as margin.
The maximal margin hyperplane is the one for which the margin is largest.
(MIS Department) Dr. Xialu Liu 14 / 32
Maximal Margin Classifier
Among all separating hyperplanes, find the one that makes the biggest gap or margin
between the two classes.
yi(β0 + β1xi1 + β2xi2 + . . . + βpxip) ≥ 0 guarantees that each observation will be
on the correct side of the hyperplane provided that M is positive.
M represents the margin of the hyperplane, and the optimization problem choose
β0, β1, . . . , βp to maximize M.
(MIS Department) Dr. Xialu Liu 15 / 32
Maximal Margin Classifier
We see three data points are equidistant from the maximal margin hyperplane and
lie along the dashed lines indicating the width of the margin.
They are known as support vectors, since they ”support” the maximal margin
hyperplane in the sense that if they were moved slightly then the hyperplane would
move as well.
It is interesting that the maximal margin hyperplane depends directly on the
support vectors but not on the other observations.
(MIS Department) Dr. Xialu Liu 16 / 32
Non-separable Data
The data on the left are not separable by a linear boundary.
We cannot exactly separate the two classes.
(MIS Department) Dr. Xialu Liu 17 / 32
Noisy Data
Sometimes even when the data are separable, but they are noisy. This can lead to
a poor solution for the maximal-margin classifier.
We want to consider a classifier that does not perfectly separate the two classes, in
the interest of
Greater robustness to individual observations
Better classification of most of data.
It could be worthwhile to misclassify a few data points in order to do a better job
in classifying remaining data.
(MIS Department) Dr. Xialu Liu 18 / 32
Rather than seeking the largest possible margin so that every observation is on the
correct side of the hyperplane, we instead allow some observations to be on the
incorrect side of the margin, or even the hyperplane.
The support vector classifier maximizes a soft margin.
The margin is soft, because it can be violated by some data points.
(MIS Department) Dr. Xialu Liu 19 / 32
Support Vector Classifier
max
β0,β1,...,βp,1,...,n
M
subject to
p
X
j=1
β2
j = 1,
yi(β0 + β1xi1 + β2xi2 + . . . + βpxip) ≥ M(1 − i),
i ≥ 0,
n
X
i=1
i ≤ C.
C is a nonnegative tuning parameter
M is the width of the margin, and we seek to make it as large as possible
1, . . . , n are slack variables that all data to be on the wrong side of the margin or
hyperplane.
If i = 0, the i-th observation is on the correct side of the margin.
If 0  i  1, then the i-th observation is on the wrong side of the margin,
and we say the i-th observation has violated the margin.
If   1, it is on the wrong side of the hyperplane.
(MIS Department) Dr. Xialu Liu 20 / 32
C is a regularization parameter
If C = 0, there is no budget for violations to the margin with 1 = 2 = . . . = 0.
If C  0, no more than C observations an be on the wrong side of the hyperplane.
As C increases, we are more tolerant of violations.
(MIS Department) Dr. Xialu Liu 21 / 32
(MIS Department) Dr. Xialu Liu 22 / 32
Linear Boundary Can Fail
Sometime a linear boundary simply won’t work, no matter what value of C.
What to do?
(MIS Department) Dr. Xialu Liu 23 / 32
Feature Expansion
Enlarge the space of features by including transformations; e.g. X2
1 , X3
1 , X1X2,
X1X2
2 ,. . . . Hence go from a p-dimensional space to a space with dimension
greater than p.
Fit a support vector classifier in the enlarged space.
This results in non-linear decision boundaries in the original space.
Example: Suppose we use (X1, X2, X2
1 , X2
2 , X1X2) instead of just (X1, X2). Then the
decision boundary would be of the form
β0 + β1X1 + β2X2 + β3X2
1 + β4X2
2 + β5X1X2 = 0
This leads to nonlinear decision boundaries in the original space (quadratic conic
sections).
(MIS Department) Dr. Xialu Liu 24 / 32
Cubic Polynomial
(MIS Department) Dr. Xialu Liu 25 / 32
Feature Expansion
Polynomials (especially high-dimensional ones) get wild rather fast.
There is a more elegant and controlled way to introduce nonlinearities in support
vector classifiers - through the use of kernels.
Before we discuss these, we must understand the role of inner products in support
vector classifiers.
We do not discussed exactly how the support vector classifier is computed because
details are quite technical. But it turns out that the solution involves the inner
products of the data.
(MIS Department) Dr. Xialu Liu 26 / 32
Inner Products and Support Vectors
Inner product between vectors:
hxi, xi0 i =
p
X
j=1
xijxi0j
The linear support vector classifier can be represented as
f(x) = β0 +
n
X
i=1
αihx, xii n parameters (1)
To estimate the parameters α1, . . . , αn and β0, all we need are the n
2

inner
products hxi, xi0 i between all pairs of training observations.
It turns out that most of the α̂i are zero; it is nonzero for the support vectors only.
f(x) = β0 +
X
i∈S
α̂ihx, xii (2)
S is the support set of indices i such that α̂i  0. (2) involves far fewer terms than (1).
(MIS Department) Dr. Xialu Liu 27 / 32
Kernels and Support Vector Machines
If we can compute inner-products between observations, we can fit a SV classifier.
Here we replace it with a generalization of the inner product of the form
K(xi, xi0 ),
where K is some function that we will refer to as a kernel. A kernel is a function
that quantifies the similarity of two observations.
For example,
K(xi, xi0 ) =
p
X
j=1
xijxi0j, (3)
which gives us back to the support vector classifier. (3) is known as a linear kernel.
We can also choose the following kernel which is called polynomial kernel of degree
d.
K(xi, xi0 ) = 1 +
p
X
j=1
xijxi0j
!d
(4)
It gives a much flexible decision boundary, because it fits a support vector classifier
in a higher-dimensional space.
(MIS Department) Dr. Xialu Liu 28 / 32
Kernels and Support Vector Machines
When the support vector classifier is combined with a non-linear kernel such as
(4), the resulting classifier is known as a support vector machine. In this case, the
classifier can be written as
f(x) = β0 +
X
i∈S
αiK(x, xi).
(MIS Department) Dr. Xialu Liu 29 / 32
Radial Kernel
K(xi, xi0 ) = exp −γ
p
X
j=1
(xij − xi0j)2
!
, where γ  0.
f(x) = β0 +
X
i∈S
α̂iK(x, xi)
Advantage: Implicit feature space; very high dimensional.
(MIS Department) Dr. Xialu Liu 30 / 32
Heart Test Data
We apply the support vector machines to the Heart data. These data contain a binary
outcome HD for 303 patients who presented with chest pain. An outcome value of Yes
indicates the presence of heart disease and No means no heart disease. The aim is to use
13 predictors such as Age, Sex, and Chol(a cholesterol measurement) in order to predict
whether an individual has heart disease.
ROC curve is obtained by changing the threshold 0 to threshold t in ˆ
f(X)  t, and
recording false positive and true positive rates as t varies. Here we see ROC curves on
test data.
(MIS Department) Dr. Xialu Liu 31 / 32
Summary
Maximal margin classifier
Support vector classifier
Support vector machine
(MIS Department) Dr. Xialu Liu 32 / 32

More Related Content

Similar to Bigdatanauiduihaunjcinacssdzhniuasdb ahcbsibcas

USE OF BARNES-HUT ALGORITHM TO ATTACK COVID-19 VIRUS
USE OF BARNES-HUT ALGORITHM TO ATTACK COVID-19 VIRUSUSE OF BARNES-HUT ALGORITHM TO ATTACK COVID-19 VIRUS
USE OF BARNES-HUT ALGORITHM TO ATTACK COVID-19 VIRUS
IJCI JOURNAL
 
Diagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelDiagnostic methods for Building the regression model
Diagnostic methods for Building the regression model
Mehdi Shayegani
 
ISI MSQE Entrance Question Paper (2011)
ISI MSQE Entrance Question Paper (2011)ISI MSQE Entrance Question Paper (2011)
ISI MSQE Entrance Question Paper (2011)
CrackDSE
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machinenozomuhamada
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
The Statistical and Applied Mathematical Sciences Institute
 
On elements of deterministic chaos and cross links in non- linear dynamical s...
On elements of deterministic chaos and cross links in non- linear dynamical s...On elements of deterministic chaos and cross links in non- linear dynamical s...
On elements of deterministic chaos and cross links in non- linear dynamical s...
iosrjce
 
Slides Bank England
Slides Bank EnglandSlides Bank England
Slides Bank England
Arthur Charpentier
 
Chapter6.pdf.pdf
Chapter6.pdf.pdfChapter6.pdf.pdf
Chapter6.pdf.pdf
ROBERTOENRIQUEGARCAA1
 
E5-roughsets unit-V.pdf
E5-roughsets unit-V.pdfE5-roughsets unit-V.pdf
E5-roughsets unit-V.pdf
Ramya Nellutla
 
Chapter1
Chapter1Chapter1
Chapter5.pdf.pdf
Chapter5.pdf.pdfChapter5.pdf.pdf
Chapter5.pdf.pdf
ROBERTOENRIQUEGARCAA1
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Statistical and Applied Mathematical Sciences Institute
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
Axel de Romblay
 
Class9_PCA_final.ppt
Class9_PCA_final.pptClass9_PCA_final.ppt
Class9_PCA_final.ppt
MaTruongThanh002937
 
Support vector machine (Machine Learning)
Support vector machine (Machine Learning)Support vector machine (Machine Learning)
Support vector machine (Machine Learning)
VARUN KUMAR
 
linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.ppt
MahimMajee
 
nber_slides.pdf
nber_slides.pdfnber_slides.pdf
nber_slides.pdf
ssuser05b736
 
Normal Distribution.pptx
Normal Distribution.pptxNormal Distribution.pptx
Normal Distribution.pptx
Ravindra Nath Shukla
 

Similar to Bigdatanauiduihaunjcinacssdzhniuasdb ahcbsibcas (20)

USE OF BARNES-HUT ALGORITHM TO ATTACK COVID-19 VIRUS
USE OF BARNES-HUT ALGORITHM TO ATTACK COVID-19 VIRUSUSE OF BARNES-HUT ALGORITHM TO ATTACK COVID-19 VIRUS
USE OF BARNES-HUT ALGORITHM TO ATTACK COVID-19 VIRUS
 
Diagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelDiagnostic methods for Building the regression model
Diagnostic methods for Building the regression model
 
ISI MSQE Entrance Question Paper (2011)
ISI MSQE Entrance Question Paper (2011)ISI MSQE Entrance Question Paper (2011)
ISI MSQE Entrance Question Paper (2011)
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
On elements of deterministic chaos and cross links in non- linear dynamical s...
On elements of deterministic chaos and cross links in non- linear dynamical s...On elements of deterministic chaos and cross links in non- linear dynamical s...
On elements of deterministic chaos and cross links in non- linear dynamical s...
 
Slides Bank England
Slides Bank EnglandSlides Bank England
Slides Bank England
 
Chapter6.pdf.pdf
Chapter6.pdf.pdfChapter6.pdf.pdf
Chapter6.pdf.pdf
 
E5-roughsets unit-V.pdf
E5-roughsets unit-V.pdfE5-roughsets unit-V.pdf
E5-roughsets unit-V.pdf
 
Chapter1
Chapter1Chapter1
Chapter1
 
Powerpoint2.reg
Powerpoint2.regPowerpoint2.reg
Powerpoint2.reg
 
Chapter5.pdf.pdf
Chapter5.pdf.pdfChapter5.pdf.pdf
Chapter5.pdf.pdf
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
 
Class9_PCA_final.ppt
Class9_PCA_final.pptClass9_PCA_final.ppt
Class9_PCA_final.ppt
 
Support vector machine (Machine Learning)
Support vector machine (Machine Learning)Support vector machine (Machine Learning)
Support vector machine (Machine Learning)
 
linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.ppt
 
nber_slides.pdf
nber_slides.pdfnber_slides.pdf
nber_slides.pdf
 
Normal Distribution.pptx
Normal Distribution.pptxNormal Distribution.pptx
Normal Distribution.pptx
 
Nl eqn lab
Nl eqn labNl eqn lab
Nl eqn lab
 

Recently uploaded

Collective Mining | Corporate Presentation - May 2024
Collective Mining | Corporate Presentation - May 2024Collective Mining | Corporate Presentation - May 2024
Collective Mining | Corporate Presentation - May 2024
CollectiveMining1
 
Snam 2023-27 Industrial Plan - Financial Presentation
Snam 2023-27 Industrial Plan - Financial PresentationSnam 2023-27 Industrial Plan - Financial Presentation
Snam 2023-27 Industrial Plan - Financial Presentation
Valentina Ottini
 
Collective Mining | Corporate Presentation - May 2024
Collective Mining | Corporate Presentation - May 2024Collective Mining | Corporate Presentation - May 2024
Collective Mining | Corporate Presentation - May 2024
CollectiveMining1
 
Investor Day 2024 Presentation Sysco 2024
Investor Day 2024 Presentation Sysco 2024Investor Day 2024 Presentation Sysco 2024
Investor Day 2024 Presentation Sysco 2024
Sysco_Investors
 
cyberagent_For New Investors_EN_240424.pdf
cyberagent_For New Investors_EN_240424.pdfcyberagent_For New Investors_EN_240424.pdf
cyberagent_For New Investors_EN_240424.pdf
CyberAgent, Inc.
 
New Tax Regime User Guide Flexi Plan Revised (1).pptx
New Tax Regime User Guide Flexi Plan Revised (1).pptxNew Tax Regime User Guide Flexi Plan Revised (1).pptx
New Tax Regime User Guide Flexi Plan Revised (1).pptx
RajkumarRajamanikam
 
一比一原版CBU毕业证卡普顿大学毕业证成绩单如何办理
一比一原版CBU毕业证卡普顿大学毕业证成绩单如何办理一比一原版CBU毕业证卡普顿大学毕业证成绩单如何办理
一比一原版CBU毕业证卡普顿大学毕业证成绩单如何办理
btohy
 

Recently uploaded (7)

Collective Mining | Corporate Presentation - May 2024
Collective Mining | Corporate Presentation - May 2024Collective Mining | Corporate Presentation - May 2024
Collective Mining | Corporate Presentation - May 2024
 
Snam 2023-27 Industrial Plan - Financial Presentation
Snam 2023-27 Industrial Plan - Financial PresentationSnam 2023-27 Industrial Plan - Financial Presentation
Snam 2023-27 Industrial Plan - Financial Presentation
 
Collective Mining | Corporate Presentation - May 2024
Collective Mining | Corporate Presentation - May 2024Collective Mining | Corporate Presentation - May 2024
Collective Mining | Corporate Presentation - May 2024
 
Investor Day 2024 Presentation Sysco 2024
Investor Day 2024 Presentation Sysco 2024Investor Day 2024 Presentation Sysco 2024
Investor Day 2024 Presentation Sysco 2024
 
cyberagent_For New Investors_EN_240424.pdf
cyberagent_For New Investors_EN_240424.pdfcyberagent_For New Investors_EN_240424.pdf
cyberagent_For New Investors_EN_240424.pdf
 
New Tax Regime User Guide Flexi Plan Revised (1).pptx
New Tax Regime User Guide Flexi Plan Revised (1).pptxNew Tax Regime User Guide Flexi Plan Revised (1).pptx
New Tax Regime User Guide Flexi Plan Revised (1).pptx
 
一比一原版CBU毕业证卡普顿大学毕业证成绩单如何办理
一比一原版CBU毕业证卡普顿大学毕业证成绩单如何办理一比一原版CBU毕业证卡普顿大学毕业证成绩单如何办理
一比一原版CBU毕业证卡普顿大学毕业证成绩单如何办理
 

Bigdatanauiduihaunjcinacssdzhniuasdb ahcbsibcas

  • 1. MIS 720 Electronic Business and Big Data Infrastructures Lecture 07: Support Vector Machines Dr. Xialu Liu Management Information Systems Department San Diego State University (MIS Department) Dr. Xialu Liu 1 / 32
  • 2. Classification Problems Classification is the problem of identifying which of a set of categories an observation belongs to. Given a feature matrix X and a qualitative response Y taking values in the set S, the classification task is to build a function C(X) that takes as input the feature vector Y and predicts its value for Y ; i.e. C(X) ∈ S. Often we are more interested in estimating the probabilities that Y belongs to each category in S. (MIS Department) Dr. Xialu Liu 2 / 32
  • 3. Classification Problems Classification is the problem of identifying which of a set of categories an observation belongs to. Given a feature matrix X and a qualitative response Y taking values in the set S, the classification task is to build a function C(X) that takes as input the feature vector Y and predicts its value for Y ; i.e. C(X) ∈ S. Often we are more interested in estimating the probabilities that Y belongs to each category in S. For example, it is more valuable to have an estimate of the probability that a credit cardholder defaults, than a credit cardholder defaults or not. (MIS Department) Dr. Xialu Liu 2 / 32
  • 4. Classification Problems Here the response variable Y is qualitative - e.g. email is one of C = (spam, ham) (ham=good email), digit class is one of S = {0, 1, ..., 9}. Our goals are to: Build a classifier C(X) that assigns a class label from S to a future unlabeled observation. Assess the uncertainty in each classification Understand the roles of the different predictors among X = (X1, X2, . . . , Xp). (MIS Department) Dr. Xialu Liu 3 / 32
  • 5. Example: Credit Card Default We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance. Default data set contains annual income and monthly credit card balance for a subset of 10,000 individuals. Blue circle represents ”not default” and orange pluses represents ”default”. (MIS Department) Dr. Xialu Liu 4 / 32
  • 6. Overview We discuss support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then. Maximal margin classifier: it is simple and elegant but it requires that the classes be separable by a linear boundary. Support vector classifier: it can be applied in a broader range of cases. Support vector machine: it accommodates non-linear class boundaries. Note: people often loosely refer to the maximal margin classifier, support vector classifier, support vector machine as ”support vector machines”. To avoid confusion, we will carefully distinguish between these three notions in this lecture. (MIS Department) Dr. Xialu Liu 5 / 32
  • 7. Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a hyperplane that separates the classes in feature space. If we cannot, we get creative in two ways: We soften what we mean by ”separates”, and We enrich and enlarge the feature space so that separation is possible. (MIS Department) Dr. Xialu Liu 6 / 32
  • 8. Hyperplane A hyperplane in p dimensions is a flat affine subspace of dimension p − 1. For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace- in other words, a line. A hyperplane is β0 + β1X1 + β2X2 = 0 If a point X = (X1, X2) satisfies β0 + β1X1 + β2X2 = 0, it lies on the line; if it satisfies β0 + β1X1 + β2X2 > 0, it lies above the line; if it satisfies β0 + β1X1 + β2X2 < 0, it lies below the line. (MIS Department) Dr. Xialu Liu 7 / 32
  • 9. Hyperplane in 2 Dimensions The following plot shows a hyperplane in a 2-dimensional space −6 + β1X1 + β2X2 = 0 (the blue line). One point above the line is −6 + β1X1 + β2X2 = 1.6, and the other point below the line is −6 + β1X1 + β2X2 = −4. To determine a point is above the line or below the line, we need to calculate the inner product of X = (X1, X2) and β = (β1, β2). hX, βi = β1X1 + β2X2 (MIS Department) Dr. Xialu Liu 8 / 32
  • 10. Hyperplane in 3 Dimensions For three dimensions, a hyperplane is a flat two-dimensional plane. A hyperplane is β0 + β1X1 + β2X2 + β3X3 = 0 If a point X = (X1, X2, X3) satisfies β0 + β1X1 + β2X2 + β3X3 = 0, it lies on the plane; if it satisfies β0 + β1X1 + β2X2 + β3X3 > 0, it lies above the plane; if it satisfies β0 + β1X1 + β2X2 + β3X3 < 0, it lies below the plane. To determine a point is above the plane or below the plane, we need to calculate the inner product of X = (X1, X2, X3) and β = (β1, β2, β3). hX, βi = β1X1 + β2X2 + β3X3 (MIS Department) Dr. Xialu Liu 9 / 32
  • 11. Hyperplane In general the equation for a hyperplane has the form β0 + β1X1 + β2X2 + . . . + βpXp = 0 It is hard to visualize the hyperplane, but the notion of a (p − 1)-dimensional flat subspace still applies. It divides the p-dimensional space into two halves. If f(X) = β0 + β1X1 + . . . + βpXp, then f(X) > 0 for points on one side of the hyperplane, and f(X) < 0 for points on the other. The vector β = (β1, β2, . . . , βp) is called the normal vector - it points in a direction orthogonal to the surface of a hyperplane. (MIS Department) Dr. Xialu Liu 10 / 32
  • 12. Separating Hyperplanes If we code the colored points as Yi = +1 for blue, say, and Yi = −1 for purple, then if Yi · f(Xi) > 0 for all i, f(X) = 0 defines a separating hyperplane. (MIS Department) Dr. Xialu Liu 11 / 32
  • 13. Maximal Margin Classifier Suppose it is possible to construct a hyperplane that separates the data perfectly according to their labels. Label observations from the blue class with yi = 1 and from purple class with yi = −1, then β0 + β1xi1 + β2xi2 + . . . + βpxip > 0, if yi = 1, and β0 + β1xi1 + β2xi2 + . . . + βpxip < 0, if yi = −1. Then it means a hyperplane has the property that yi(β0 + β1xi1 + β2xi2 + . . . + βpxip) > 0 for all i = 1, . . . , n. (MIS Department) Dr. Xialu Liu 12 / 32
  • 14. Separating Hyperplanes If data can perfectly separated using a hyperplane, then there could exist an infinite number of such hyperplane, shown in the plot. Which one to choose? (MIS Department) Dr. Xialu Liu 13 / 32
  • 15. Maximal Margin Classifier A natural choice is the maximal margin hyperplane, which is the separating plane that is farthest from the data. We compute the distance from each data point to a given hyperplane; the smallest such distance is known as margin. The maximal margin hyperplane is the one for which the margin is largest. (MIS Department) Dr. Xialu Liu 14 / 32
  • 16. Maximal Margin Classifier Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes. yi(β0 + β1xi1 + β2xi2 + . . . + βpxip) ≥ 0 guarantees that each observation will be on the correct side of the hyperplane provided that M is positive. M represents the margin of the hyperplane, and the optimization problem choose β0, β1, . . . , βp to maximize M. (MIS Department) Dr. Xialu Liu 15 / 32
  • 17. Maximal Margin Classifier We see three data points are equidistant from the maximal margin hyperplane and lie along the dashed lines indicating the width of the margin. They are known as support vectors, since they ”support” the maximal margin hyperplane in the sense that if they were moved slightly then the hyperplane would move as well. It is interesting that the maximal margin hyperplane depends directly on the support vectors but not on the other observations. (MIS Department) Dr. Xialu Liu 16 / 32
  • 18. Non-separable Data The data on the left are not separable by a linear boundary. We cannot exactly separate the two classes. (MIS Department) Dr. Xialu Liu 17 / 32
  • 19. Noisy Data Sometimes even when the data are separable, but they are noisy. This can lead to a poor solution for the maximal-margin classifier. We want to consider a classifier that does not perfectly separate the two classes, in the interest of Greater robustness to individual observations Better classification of most of data. It could be worthwhile to misclassify a few data points in order to do a better job in classifying remaining data. (MIS Department) Dr. Xialu Liu 18 / 32
  • 20. Rather than seeking the largest possible margin so that every observation is on the correct side of the hyperplane, we instead allow some observations to be on the incorrect side of the margin, or even the hyperplane. The support vector classifier maximizes a soft margin. The margin is soft, because it can be violated by some data points. (MIS Department) Dr. Xialu Liu 19 / 32
  • 21. Support Vector Classifier max β0,β1,...,βp,1,...,n M subject to p X j=1 β2 j = 1, yi(β0 + β1xi1 + β2xi2 + . . . + βpxip) ≥ M(1 − i), i ≥ 0, n X i=1 i ≤ C. C is a nonnegative tuning parameter M is the width of the margin, and we seek to make it as large as possible 1, . . . , n are slack variables that all data to be on the wrong side of the margin or hyperplane. If i = 0, the i-th observation is on the correct side of the margin. If 0 i 1, then the i-th observation is on the wrong side of the margin, and we say the i-th observation has violated the margin. If 1, it is on the wrong side of the hyperplane. (MIS Department) Dr. Xialu Liu 20 / 32
  • 22. C is a regularization parameter If C = 0, there is no budget for violations to the margin with 1 = 2 = . . . = 0. If C 0, no more than C observations an be on the wrong side of the hyperplane. As C increases, we are more tolerant of violations. (MIS Department) Dr. Xialu Liu 21 / 32
  • 23. (MIS Department) Dr. Xialu Liu 22 / 32
  • 24. Linear Boundary Can Fail Sometime a linear boundary simply won’t work, no matter what value of C. What to do? (MIS Department) Dr. Xialu Liu 23 / 32
  • 25. Feature Expansion Enlarge the space of features by including transformations; e.g. X2 1 , X3 1 , X1X2, X1X2 2 ,. . . . Hence go from a p-dimensional space to a space with dimension greater than p. Fit a support vector classifier in the enlarged space. This results in non-linear decision boundaries in the original space. Example: Suppose we use (X1, X2, X2 1 , X2 2 , X1X2) instead of just (X1, X2). Then the decision boundary would be of the form β0 + β1X1 + β2X2 + β3X2 1 + β4X2 2 + β5X1X2 = 0 This leads to nonlinear decision boundaries in the original space (quadratic conic sections). (MIS Department) Dr. Xialu Liu 24 / 32
  • 26. Cubic Polynomial (MIS Department) Dr. Xialu Liu 25 / 32
  • 27. Feature Expansion Polynomials (especially high-dimensional ones) get wild rather fast. There is a more elegant and controlled way to introduce nonlinearities in support vector classifiers - through the use of kernels. Before we discuss these, we must understand the role of inner products in support vector classifiers. We do not discussed exactly how the support vector classifier is computed because details are quite technical. But it turns out that the solution involves the inner products of the data. (MIS Department) Dr. Xialu Liu 26 / 32
  • 28. Inner Products and Support Vectors Inner product between vectors: hxi, xi0 i = p X j=1 xijxi0j The linear support vector classifier can be represented as f(x) = β0 + n X i=1 αihx, xii n parameters (1) To estimate the parameters α1, . . . , αn and β0, all we need are the n 2 inner products hxi, xi0 i between all pairs of training observations. It turns out that most of the α̂i are zero; it is nonzero for the support vectors only. f(x) = β0 + X i∈S α̂ihx, xii (2) S is the support set of indices i such that α̂i 0. (2) involves far fewer terms than (1). (MIS Department) Dr. Xialu Liu 27 / 32
  • 29. Kernels and Support Vector Machines If we can compute inner-products between observations, we can fit a SV classifier. Here we replace it with a generalization of the inner product of the form K(xi, xi0 ), where K is some function that we will refer to as a kernel. A kernel is a function that quantifies the similarity of two observations. For example, K(xi, xi0 ) = p X j=1 xijxi0j, (3) which gives us back to the support vector classifier. (3) is known as a linear kernel. We can also choose the following kernel which is called polynomial kernel of degree d. K(xi, xi0 ) = 1 + p X j=1 xijxi0j !d (4) It gives a much flexible decision boundary, because it fits a support vector classifier in a higher-dimensional space. (MIS Department) Dr. Xialu Liu 28 / 32
  • 30. Kernels and Support Vector Machines When the support vector classifier is combined with a non-linear kernel such as (4), the resulting classifier is known as a support vector machine. In this case, the classifier can be written as f(x) = β0 + X i∈S αiK(x, xi). (MIS Department) Dr. Xialu Liu 29 / 32
  • 31. Radial Kernel K(xi, xi0 ) = exp −γ p X j=1 (xij − xi0j)2 ! , where γ 0. f(x) = β0 + X i∈S α̂iK(x, xi) Advantage: Implicit feature space; very high dimensional. (MIS Department) Dr. Xialu Liu 30 / 32
  • 32. Heart Test Data We apply the support vector machines to the Heart data. These data contain a binary outcome HD for 303 patients who presented with chest pain. An outcome value of Yes indicates the presence of heart disease and No means no heart disease. The aim is to use 13 predictors such as Age, Sex, and Chol(a cholesterol measurement) in order to predict whether an individual has heart disease. ROC curve is obtained by changing the threshold 0 to threshold t in ˆ f(X) t, and recording false positive and true positive rates as t varies. Here we see ROC curves on test data. (MIS Department) Dr. Xialu Liu 31 / 32
  • 33. Summary Maximal margin classifier Support vector classifier Support vector machine (MIS Department) Dr. Xialu Liu 32 / 32