Bigdatanauiduihaunjcinacssdzhniuasdb ahcbsibcas

MIS 720 Electronic Business and Big Data
Infrastructures
Lecture 07: Support Vector Machines
Dr. Xialu Liu
Management Information Systems Department
San Diego State University
(MIS Department) Dr. Xialu Liu 1 / 32

Classification Problems
Classification is the problem of identifying which of a set of categories
an observation belongs to.
Given a feature matrix X and a qualitative response Y taking values
in the set S, the classification task is to build a function C(X) that
takes as input the feature vector Y and predicts its value for Y ; i.e.
C(X) ∈ S.
Often we are more interested in estimating the probabilities that Y
belongs to each category in S.

Classification is the problem of identifying which of a set of categories
an observation belongs to.
Given a feature matrix X and a qualitative response Y taking values
in the set S, the classification task is to build a function C(X) that
takes as input the feature vector Y and predicts its value for Y ; i.e.
C(X) ∈ S.
Often we are more interested in estimating the probabilities that Y
belongs to each category in S.
For example, it is more valuable to have an estimate of the probability
that a credit cardholder defaults, than a credit cardholder defaults or
not.

Here the response variable Y is qualitative - e.g. email is one of C = (spam, ham)
(ham=good email), digit class is one of S = {0, 1, ..., 9}. Our goals are to:
Build a classifier C(X) that assigns a class label from S to a future
unlabeled observation.
Assess the uncertainty in each classification
Understand the roles of the different predictors among
X = (X1, X2, . . . , Xp).

Example: Credit Card Default
We are interested in predicting whether an individual will default on his or her credit
card payment, on the basis of annual income and monthly credit card balance. Default
data set contains annual income and monthly credit card balance for a subset of
10,000 individuals.
Blue circle represents ”not default” and orange pluses represents ”default”.

Overview
We discuss support vector machine (SVM), an approach for classification that
was developed in the computer science community in the 1990s and that has
grown in popularity since then.
Maximal margin classifier: it is simple and elegant but it requires that the
classes be separable by a linear boundary.
Support vector classifier: it can be applied in a broader range of cases.
Support vector machine: it accommodates non-linear class boundaries.
Note: people often loosely refer to the maximal margin classifier, support vector
classifier, support vector machine as ”support vector machines”.
To avoid confusion, we will carefully distinguish between these three notions in
this lecture.

Support Vector Machines
Here we approach the two-class classification problem in a direct way:
We try and find a hyperplane that separates the classes in feature space.
If we cannot, we get creative in two ways:
We soften what we mean by ”separates”, and
We enrich and enlarge the feature space so that separation is possible.

Hyperplane
A hyperplane in p dimensions is a flat affine subspace of dimension p − 1.
For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace-
in other words, a line. A hyperplane is
β0 + β1X1 + β2X2 = 0
If a point X = (X1, X2) satisfies β0 + β1X1 + β2X2 = 0, it lies on the line; if it satisfies
β0 + β1X1 + β2X2 > 0, it lies above the line; if it satisfies β0 + β1X1 + β2X2 < 0, it
lies below the line.

Hyperplane in 2 Dimensions
The following plot shows a hyperplane in a 2-dimensional space −6 + β1X1 + β2X2 = 0
(the blue line).
One point above the line is −6 + β1X1 + β2X2 = 1.6, and the other point below the
line is −6 + β1X1 + β2X2 = −4.
To determine a point is above the line or below the line, we need to calculate the inner
product of X = (X1, X2) and β = (β1, β2).
hX, βi = β1X1 + β2X2

Hyperplane in 3 Dimensions
For three dimensions, a hyperplane is a flat two-dimensional plane. A hyperplane is
β0 + β1X1 + β2X2 + β3X3 = 0
If a point X = (X1, X2, X3) satisfies β0 + β1X1 + β2X2 + β3X3 = 0, it lies on
the plane; if it satisfies β0 + β1X1 + β2X2 + β3X3 > 0, it lies above the plane; if
it satisfies β0 + β1X1 + β2X2 + β3X3 < 0, it lies below the plane.
To determine a point is above the plane or below the plane, we need to calculate
the inner product of X = (X1, X2, X3) and β = (β1, β2, β3).
hX, βi = β1X1 + β2X2 + β3X3

Hyperplane
In general the equation for a hyperplane has the form
β0 + β1X1 + β2X2 + . . . + βpXp = 0
It is hard to visualize the hyperplane, but the notion of a (p − 1)-dimensional flat
subspace still applies.
It divides the p-dimensional space into two halves.
If f(X) = β0 + β1X1 + . . . + βpXp, then f(X) > 0 for points on one side of the
hyperplane, and f(X) < 0 for points on the other.
The vector β = (β1, β2, . . . , βp) is called the normal vector - it points in a
direction orthogonal to the surface of a hyperplane.

Separating Hyperplanes
If we code the colored points as Yi = +1 for blue, say, and Yi = −1 for purple,
then if Yi · f(Xi) > 0 for all i, f(X) = 0 defines a separating hyperplane.

Maximal Margin Classifier
Suppose it is possible to construct a hyperplane that separates the data perfectly
according to their labels. Label observations from the blue class with yi = 1 and
from purple class with yi = −1, then
β0 + β1xi1 + β2xi2 + . . . + βpxip > 0, if yi = 1,
and
β0 + β1xi1 + β2xi2 + . . . + βpxip < 0, if yi = −1.
Then it means a hyperplane has the property that
yi(β0 + β1xi1 + β2xi2 + . . . + βpxip) > 0
for all i = 1, . . . , n.

Separating Hyperplanes
If data can perfectly separated using a hyperplane, then there could exist an
infinite number of such hyperplane, shown in the plot.
Which one to choose?

A natural choice is the maximal margin hyperplane, which is the separating plane
that is farthest from the data.
We compute the distance from
each data point to a given hyperplane; the smallest such distance is known as margin.
The maximal margin hyperplane is the one for which the margin is largest.

Among all separating hyperplanes, find the one that makes the biggest gap or margin
between the two classes.
yi(β0 + β1xi1 + β2xi2 + . . . + βpxip) ≥ 0 guarantees that each observation will be
on the correct side of the hyperplane provided that M is positive.
M represents the margin of the hyperplane, and the optimization problem choose
β0, β1, . . . , βp to maximize M.

We see three data points are equidistant from the maximal margin hyperplane and
lie along the dashed lines indicating the width of the margin.
They are known as support vectors, since they ”support” the maximal margin
hyperplane in the sense that if they were moved slightly then the hyperplane would
move as well.
It is interesting that the maximal margin hyperplane depends directly on the
support vectors but not on the other observations.

Non-separable Data
The data on the left are not separable by a linear boundary.
We cannot exactly separate the two classes.

Noisy Data
Sometimes even when the data are separable, but they are noisy. This can lead to
a poor solution for the maximal-margin classifier.
We want to consider a classifier that does not perfectly separate the two classes, in
the interest of
Greater robustness to individual observations
Better classification of most of data.
It could be worthwhile to misclassify a few data points in order to do a better job
in classifying remaining data.

Rather than seeking the largest possible margin so that every observation is on the
correct side of the hyperplane, we instead allow some observations to be on the
incorrect side of the margin, or even the hyperplane.
The support vector classifier maximizes a soft margin.
The margin is soft, because it can be violated by some data points.

Support Vector Classifier
max
β0,β1,...,βp,1,...,n
M
subject to
p
X
j=1
β2
j = 1,
yi(β0 + β1xi1 + β2xi2 + . . . + βpxip) ≥ M(1 − i),
i ≥ 0,
n
X
i=1
i ≤ C.
C is a nonnegative tuning parameter
M is the width of the margin, and we seek to make it as large as possible
1, . . . , n are slack variables that all data to be on the wrong side of the margin or
hyperplane.
If i = 0, the i-th observation is on the correct side of the margin.
If 0 i 1, then the i-th observation is on the wrong side of the margin,
and we say the i-th observation has violated the margin.
If 1, it is on the wrong side of the hyperplane.

C is a regularization parameter
If C = 0, there is no budget for violations to the margin with 1 = 2 = . . . = 0.
If C 0, no more than C observations an be on the wrong side of the hyperplane.
As C increases, we are more tolerant of violations.

Linear Boundary Can Fail
Sometime a linear boundary simply won’t work, no matter what value of C.
What to do?

Feature Expansion
Enlarge the space of features by including transformations; e.g. X2
1 , X3
1 , X1X2,
X1X2
2 ,. . . . Hence go from a p-dimensional space to a space with dimension
greater than p.
Fit a support vector classifier in the enlarged space.
This results in non-linear decision boundaries in the original space.
Example: Suppose we use (X1, X2, X2
1 , X2
2 , X1X2) instead of just (X1, X2). Then the
decision boundary would be of the form
β0 + β1X1 + β2X2 + β3X2
1 + β4X2
2 + β5X1X2 = 0
This leads to nonlinear decision boundaries in the original space (quadratic conic
sections).

Cubic Polynomial

Feature Expansion
Polynomials (especially high-dimensional ones) get wild rather fast.
There is a more elegant and controlled way to introduce nonlinearities in support
vector classifiers - through the use of kernels.
Before we discuss these, we must understand the role of inner products in support
vector classifiers.
We do not discussed exactly how the support vector classifier is computed because
details are quite technical. But it turns out that the solution involves the inner
products of the data.

Inner Products and Support Vectors
Inner product between vectors:
hxi, xi0 i =
p
X
j=1
xijxi0j
The linear support vector classifier can be represented as
f(x) = β0 +
n
X
i=1
αihx, xii n parameters (1)
To estimate the parameters α1, . . . , αn and β0, all we need are the n
2

inner
products hxi, xi0 i between all pairs of training observations.
It turns out that most of the α̂i are zero; it is nonzero for the support vectors only.
f(x) = β0 +
X
i∈S
α̂ihx, xii (2)
S is the support set of indices i such that α̂i 0. (2) involves far fewer terms than (1).

Kernels and Support Vector Machines
If we can compute inner-products between observations, we can fit a SV classifier.
Here we replace it with a generalization of the inner product of the form
K(xi, xi0 ),
where K is some function that we will refer to as a kernel. A kernel is a function
that quantifies the similarity of two observations.
For example,
K(xi, xi0 ) =
p
X
j=1
xijxi0j, (3)
which gives us back to the support vector classifier. (3) is known as a linear kernel.
We can also choose the following kernel which is called polynomial kernel of degree
d.
K(xi, xi0 ) = 1 +
p
X
j=1
xijxi0j
!d
(4)
It gives a much flexible decision boundary, because it fits a support vector classifier
in a higher-dimensional space.

Kernels and Support Vector Machines
When the support vector classifier is combined with a non-linear kernel such as
(4), the resulting classifier is known as a support vector machine. In this case, the
classifier can be written as
f(x) = β0 +
X
i∈S
αiK(x, xi).

Radial Kernel
K(xi, xi0 ) = exp −γ
p
X
j=1
(xij − xi0j)2
!
, where γ 0.
f(x) = β0 +
X
i∈S
α̂iK(x, xi)
Advantage: Implicit feature space; very high dimensional.

Heart Test Data
We apply the support vector machines to the Heart data. These data contain a binary
outcome HD for 303 patients who presented with chest pain. An outcome value of Yes
indicates the presence of heart disease and No means no heart disease. The aim is to use
13 predictors such as Age, Sex, and Chol(a cholesterol measurement) in order to predict
whether an individual has heart disease.
ROC curve is obtained by changing the threshold 0 to threshold t in ˆ
f(X) t, and
recording false positive and true positive rates as t varies. Here we see ROC curves on
test data.

Summary
Maximal margin classifier
Support vector classifier
Support vector machine

Bigdatanauiduihaunjcinacssdzhniuasdb ahcbsibcas

Recommended

Recommended

More Related Content

Similar to Bigdatanauiduihaunjcinacssdzhniuasdb ahcbsibcas

Similar to Bigdatanauiduihaunjcinacssdzhniuasdb ahcbsibcas (20)

Recently uploaded

Recently uploaded (7)

Bigdatanauiduihaunjcinacssdzhniuasdb ahcbsibcas