2. ECE 8443: Lecture 16, Slide 1
Generative Models
• Thus far we have essentially considered techniques that perform classification
indirectly by modeling the training data, optimizing the parameters of that
model, and then performing classification by choosing the closest model. This
approach is known as a generative model: by training models supervised
learning assumes we know the form of the underlying density function, which
is often not true in real applications.
• Convergence in maximum likelihood does
not guarantee optimal classification.
• Gaussian MLE modeling tends to
overfit data.
0 10 20 30 40
0
0.1
0.2
0.3
0.4
3.5 4 4.5 5
0
0.005
0.01
0.015
0.02
0.025
0.03
ML Decision
Optimal
Decision
Boundary
Boundary
(MLE Gaussian)
Discrimination
Class-Dependent PCA
• Real data often not separable by hyperplanes.
• Goal: balance representation and discrimination
in a common framework (rather than alternating
between the two).
3. ECE 8443: Lecture 16, Slide 2
Risk Minimization
• The expected risk can be defined as:
)
,
(
)
,
(
( y
x
x
y dP
f
R
• Empirical risk is defined as:
l
i
i
i
emp f
l
R
1
)
,
(
2
1
(
x
y
• These are related by the Vapnik-Chervonenkis (VC) dimension:
)
(
(
( h
f
R
R emp
l
h
l
h
h
f
)
4
/
log(
))
1
)
/
2
(log((
)
(
where
)
(h
f is referred to as the VC confidence, where is a confidence measure in
the range [0,1].
• The VC dimension, h, is a measure of the capacity of the learning machine.
• The principle of structural risk minimization (SRM) involves finding the subset
of functions that minimizes the bound on the actual risk.
• Optimal hyperplane classifiers achieve zero empirical risk for linearly
separable data.
• A Support Vector Machine is an approach that gives the least upper bound on
the risk.
confidence in the risk
empirical risk
bound on the expected risk
VC dimension
Expected risk
optimum
4. ECE 8443: Lecture 16, Slide 3
Large-Margin Classification
• Hyperplanes C0 - C2 achieve perfect classification
(zero empirical risk):
C0 is optimal in terms of generalization.
The data points that define the boundary
are called support vectors.
A hyperplane can be defined by:
We will impose the constraints:
The data points that satisfy the equality are
called support vectors.
• Support vectors are found using a constrained optimization:
• The final classifier is computed using the support vectors and the weights:
b
w
x
origin
class 1
class 2
w
H1
H2
C1
CO
C2
optimal
classifier
0
1
)
(
b
w
x
y i
i
i
N
i
i
N
i
i
i
i
p b
y
w
L
1
1
2
)
(
2
1
w
x
N
i
i
i
i b
y
f
1
)
( x
x
x
5. ECE 8443: Lecture 16, Slide 4
Class 1
Class 2
Soft-Margin Classification
• In practice, the number of support vectors will grow unacceptably large for
real problems with large amounts of data.
• Also, the system will be very sensitive to mislabeled training data or outliers.
• Solution: introduce “slack variables”
or a soft margin:
This gives the system the ability to
ignore data points near the boundary,
and effectively pushes the margin
towards the centroid of the training data.
• This is now a constrained optimization
with an additional constraint:
• The solution to this problem can still be found using Lagrange multipliers.
0
)
1
(
)
(
i
i
i
i b
w
x
y
6. ECE 8443: Lecture 16, Slide 5
Nonlinear Decision Surfaces
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f(.)
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
f( )
Feature space
Input space
• Thus far we have only considered linear decision surfaces. How do we
generalize this to a nonlinear surface?
• Our approach will be to transform the data to a higher dimensional space
where the data can be separated by a linear surface.
• Define a kernel function:
Examples of kernel functions include polynomial:
)
(
)
(
)
,
( j
i
j
i
K x
x
x
x f
f
d
j
t
i
j
i
K )
1
(
)
,
(
x
x
x
x
7. ECE 8443: Lecture 16, Slide 6
Kernel Functions
Other popular kernels are a radial basis function (popular in neural networks):
)
)
2
(
exp(
)
,
( 2
2
j
i
j
i
K x
x
x
x
and a sigmoid function:
)
tanh(
)
,
(
j
t
i
j
i k
K x
x
x
x
• Our optimization does not change significantly:
• The final classifier has a similar form:
• Let’s work some examples.
N
i
i
i
i b
K
y
f
1
)
,
(
)
( x
x
x
0
,
0
)
,
(
2
1
)
(
max
1
1 1
1
i
n
i
i
i
j
i
j
i
n
i
n
j
j
i
n
i
i
y
C
to
subject
K
y
y
W
x
x
8. ECE 8443: Lecture 16, Slide 7
SVM Limitations
Model Complexity
Error
Training Set
Error
Open-Loop
Error
Optimum
• Uses a binary (yes/no) decision rule
• Generates a distance from the hyperplane, but this distance is often not a
good measure of our “confidence” in the classification
• Can produce a “probability” as a function of the distance (e.g. using
sigmoid fits), but they are inadequate
• Number of support vectors grows linearly with the size of the data set
• Requires the estimation of trade-off parameter, C, via held-out sets
9. ECE 8443: Lecture 16, Slide 8
• Build a fully specified probabilistic model – incorporate prior
information/beliefs as well as a notion of confidence in predictions.
• MacKay posed a special form for regularization in neural networks – sparsity.
• Evidence maximization: evaluate candidate models based on their
“evidence”, P(D|Hi).
• Evidence approximation:
• Likelihood of data given best fit parameter set:
• Penalty that measures how well our posterior model
fits our prior assumptions:
• We can use set the prior in favor of sparse,
smooth models.
• Incorporates an automatic relevance
determination (ARD) prior over each weight.
Evidence Maximization
w
w
w
)
|
ˆ
(
)
,
ˆ
|
(
)
|
( H
P
H
D
P
H
D
P
)
,
ˆ
|
( H
D
P w
w
w
w
P(w|D,Hi)
P(w|Hi)
w
w
)
|
ˆ
( H
P
)
1
),
0
(
|
(
)
|
(
0
N
i i
i
i
w
N
P
w
10. ECE 8443: Lecture 16, Slide 9
• Still a kernel-based learning machine:
• Incorporates an automatic relevance determination (ARD) prior over each
weight (MacKay)
• A flat (non-informative) prior over a completes the Bayesian specification.
• The goal in training becomes finding:
• Estimation of the “sparsity” parameters is inherent in the optimization – no
need for a held-out set.
• A closed-form solution to this maximization problem is not available.
Rather, we iteratively reestimate .
Relevance Vector Machines
N
i
i
iK
w
w
y
1
0 )
,
(
)
;
( x
x
w
x )
;
(
1
1
)
;
|
1
( w
x
w
x
i
y
i
e
t
P
)
1
),
0
(
|
(
)
|
(
0
N
i i
i
i
w
N
P
w
)
|
(
)
|
,
(
)
,
,
|
(
)
,
(
)
,
|
,
(
,
max
arg
ˆ
,
ˆ
X
X
α
w
X
α
w
α
w
X
α
w
α
w
α
w
t
p
p
t
p
p
where
t
p
α
w ˆ
ˆ and
11. ECE 8443: Lecture 16, Slide 10
Summary
• Support Vector Machines are one example of a kernel-based learning machine
that is training in a discriminative fashion.
• Integrates notions of risk minimization, large-margin and soft margin
classification.
• Two fundamental innovations:
maximize the margin between the classes using actual data points,
rotate the data into a higher-dimensional space in which the data is linearly
separable.
• Training can be computationally expensive but classification is very fast.
• Note that SVMs are inherently non-probabilistic (e.g., non-Bayesian).
• SVMs can be used to estimate posteriors by mapping the SVM output to a
likelihood-like quantity using a nonlinear function (e.g., sigmoid).
• SVMs are not inherently suited to an N-way classification problem. Typical
approaches include a pairwise comparison or “one vs. world” approach.
12. ECE 8443: Lecture 16, Slide 11
Summary
• Many alternate forms include Transductive SVMs, Sequential SVMs, Support
Vector Regression, Relevance Vector Machines, and data-driven kernels.
• Key lesson learned: a linear algorithm in the feature space is equivalent to a
nonlinear algorithm in the input space. Standard linear algorithms can be
generalized (e.g., kernel principal component analysis, kernel independent
component analysis, kernel canonical correlation analysis, kernel k-means).
• What we didn’t discuss:
How do you train SVMs?
Computational complexity?
How to deal with large amounts of data?
See Ganapathiraju for an excellent, easy to understand discourse on SVMs
and Hamaker (Chapter 3) for a nice overview on RVMs. There are many other
tutorials available online (see the links on the title slide) as well.
Other methods based on kernels – more to follow.