Introduction to Machine Learning

Introduction to Machine Learning

Bernhard Schölkopf
Empirical Inference Department
Max Planck Institute for Intelligent Systems
Tübingen, Germany

http://www.tuebingen.mpg.de/bs

1

Empirical Inference

• Drawing conclusions from empirical data (observations, measurements)

• Example 1: scientific inference

y = Σi ai k(x,xi) + b
x
y y=a*x
x
x
x

x
x x
x
x
x

Leibniz, Weyl, Chaitin
2

Empirical Inference

• Drawing conclusions from empirical data (observations, measurements)

• Example 1: scientific inference
“If your experiment needs statistics [inference],
you ought to have done a better experiment.” (Rutherford)

3

Empirical Inference, II

• Example 2: perception

“The brain is nothing but a statistical decision organ”
(H. Barlow)
4

Hard Inference Problems
Sonnenburg, Rätsch, Schäfer,
Schölkopf, 2006, Journal of Machine
Learning Research

Task: classify human DNA
sequence locations into {acceptor
splice site, decoy} using 15
Million sequences of length 141,
and a Multiple-Kernel Support
Vector Machines.

PRC = Precision-Recall-Curve,
fraction of correct positive
predictions among all positively
predicted cases

• High dimensionality – consider many factors simultaneously to find the regularity
• Complex regularities – nonlinear, nonstationary, etc.
• Little prior knowledge – e.g., no mechanistic models for the data
• Need large data sets – processing requires computers and automatic inference methods
5

Hard Inference Problems, II

• We can solve scientific inference problems that humans can’t solve
• Even if it’s just because of data set size / dimensionality, this is a
quantum leap

6

Generalization (thanks to O. Bousquet)

• observe 1, 2, 4, 7,..
• What’s next?
+1 +2 +3

• 1,2,4,7,11,16,…: an+1=an+n (“lazy caterer’s sequence”)
• 1,2,4,7,12,20,…: an+2=an+1+an+1
• 1,2,4,7,13,24,…: “Tribonacci”-sequence
• 1,2,4,7,14,28: set of divisors of 28
• 1,2,4,7,1,1,5,…: decimal expansions of p=3,14159…
and e=2,718… interleaved
• The On-Line Encyclopedia of Integer Sequences: >600 hits…

7

Generalization, II

• Question: which continuation is correct (“generalizes”)?
• Answer: there’s no way to tell (“induction problem”)

• Question of statistical learning theory: how to come up
with a law that is (probably) correct (“demarcation problem”)
(more accurately: a law that is probably as correct on the test data as it is on the training data)

8

2-class classification

Learn based on m observations

generated from some

Goal: minimize expected error (“risk”)

V. Vapnik
Problem: P is unknown.
Induction principle: minimize training error (“empirical risk”)

over some class of functions. Q: is this “consistent”?
9

The law of large numbers
For all and

Does this imply “consistency” of empirical risk minimization
(optimality in the limit)?

No – need a uniform law of large numbers:

For all

10

Consistency and uniform convergence

-> LaTeX

12
Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011

Support Vector Machines

class 2
class 1

F
+-
- +
+
- k(x,x’)
+ -
= +
<F(x),F(x’)>

+
• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):
- -
representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)

• unique solution found by convex QP

Bernhard Schölkopf, 03
October 2011 13

Support Vector Machines

class 2
class 1

F
+-
- +
+
- k(x,x’)
+ -
= +
<F(x),F(x’)>

+
• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):
- -
representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)

• unique solution found by convex QP

Bernhard Schölkopf, 03
October 2011 14

Applications in Computational Geometry / Graphics

Steinke, Walder, Blanz et al.,
Eurographics’05, ’06, ‘08, ICML ’05,‘08,
NIPS ’07
15

Max-Planck-Institut für
biologische Kybernetik
Bernhard Schölkopf, Tübingen,
FIFA World Cup – Germany vs. England, June 27, 2010
3. Oktober 2011 16

Kernel Quiz

17
Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011

Kernel Methods

Bernhard Sch¨lkopf
o

Max Planck Institute for Intelligent Systems

B. Sch¨lkopf, MLSS France 2011
o

Statistical Learning Theory

1. started by Vapnik and Chervonenkis in the Sixties
2. model: we observe data generated by an unknown stochastic
regularity
3. learning = extraction of the regularity from the data
4. the analysis of the learning problem leads to notions of capacity
of the function classes that a learning machine can implement.
5. support vector machines use a particular type of function class:
classiﬁers with large “margins” in a feature space induced by a
kernel.

[47, 48]
o

Example: Regression Estimation

y

x

• Data: input-output pairs (xi, yi) ∈ R × R
• Regularity: (x1, y1), . . . (xm, ym) drawn from P(x, y)
• Learning: choose a function f : R → R such that the error,
averaged over P, is minimized.
• Problem: P is unknown, so the average cannot be computed
— need an “induction principle”

Pattern Recognition

Learn f : X → {±1} from examples
(x1, y1), . . . , (xm, ym) ∈ X×{±1}, generated i.i.d. from P(x, y),
such that the expected misclassiﬁcation error on a test set, also
drawn from P(x, y),
1
R[f ] = |f (x) − y)| dP(x, y),
2
is minimal (Risk Minimization (RM)).
Problem: P is unknown. −→ need an induction principle.
Empirical risk minimization (ERM): replace the average over
P(x, y) by an average over the training sample, i.e. minimize the
training error
1 m 1
Remp[f ] = |f (xi) − yi|
m i=1 2
o

Convergence of Means to Expectations

Law of large numbers:
Remp[f ] → R[f ]
as m → ∞.

Does this imply that empirical risk minimization will give us the
optimal result in the limit of inﬁnite sample size (“consistency”
of empirical risk minimization)?

No.
Need a uniform version of the law of large numbers. Uniform over
all functions that the learning machine can implement.

o

Consistency and Uniform Convergence

R
Risk
Remp
Remp [f]

R[f]

f f opt fm Function class

o

The Importance of the Set of Functions

What about allowing all functions from X to {±1}?
Training set (x1, y1), . . . , (xm, ym) ∈ X × {±1}
¯ ¯¯
Test patterns x1, . . . , xm ∈ X,
¯¯
such that {¯ 1, . . . , xm} ∩ {x1, . . . , xm} = {}.
x
1. f ∗(xi) = f (xi) for all i
For any f there exists f ∗ s.t.:
2. f ∗(¯ j ) = f (¯ j ) for all j.
x x
Based on the training set alone, there is no means of choosing
which one is better. On the test set, however, they give opposite
results. There is ’no free lunch’ [24, 56].
−→ a restriction must be placed on the functions that we allow

o

Restricting the Class of Functions

Two views:

1. Statistical Learning (VC) Theory: take into account the ca-
pacity of the class of functions that the learning machine can
implement

2. The Bayesian Way: place Prior distributions P(f ) over the
class of functions

o

Detailed Analysis

• loss ξi := 1 |f (xi) − yi| in {0, 1}
2
• the ξi are independent Bernoulli trials
1 m
• empirical mean m i=1 ξi (by def: equals Remp[f ])
• expected value E [ξ] (equals R[f ])

o

Chernoﬀ ’s Bound
 
 1 m 
P ξi − E [ξ] ≥ ǫ ≤ 2 exp(−2mǫ2)
m 
i=1

• here, P refers to the probability of getting a sample ξ1, . . . , ξm
with the property m m ξi − E [ξ] ≥ ǫ (is a product mea-
1
i=1
sure)

Useful corollary: Given a 2m-sample of Bernoulli trials, we have
 
 1 m 1
2m  mǫ2
P ξi − ξi ≥ ǫ ≤ 4 exp − .
m m  2
i=1 i=m+1
o

Chernoff ’s Bound, II

Translate this back into machine learning terminology: the prob-
ability of obtaining an m-sample where the training error and test
error differ by more than ǫ > 0 is bounded by

P Remp[f ] − R[f ] ≥ ǫ ≤ 2 exp(−2mǫ2).

• refers to one fixed f
• not allowed to look at the data before choosing f , hence not
suitable as a bound on the test error of a learning algorithm
using empirical risk minimization

o

Uniform Convergence (Vapnik & Chervonenkis)

Necessary and suﬃcient conditions for nontrivial consistency of
empirical risk minimization (ERM):
One-sided convergence, uniformly over all functions that can be
implemented by the learning machine.
lim P { sup (R[f ] − Remp[f ]) > ǫ} = 0
m→∞ f ∈F
for all ǫ > 0.

• note that this takes into account the whole set of functions that
can be implemented by the learning machine
• this is hard to check for a learning machine

Are there properties of learning machines (≡ sets of functions)
which ensure uniform convergence of risk?
o

How to Prove a VC Bound

Take a closer look at P{supf ∈F (R[f ] − Remp[f ]) > ǫ}.
Plan:
• if the function class F contains only one function, then Cher-
noff’s bound suffices:
P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 2 exp(−2mǫ2).
f ∈F
• if there are finitely many functions, we use the ’union bound’
• even if there are infinitely many, then on any finite sample
there are effectively only finitely many (use symmetrization
and capacity concepts)

o

The Case of Two Functions

Suppose F = {f1, f2}. Rewrite
1 2
P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ Cǫ ),
f ∈F
where
i
Cǫ := {(x1, y1), . . . , (xm, ym) | (R[fi] − Remp[fi]) > ǫ}
denotes the event that the risks of fi diﬀer by more than ǫ.
The RHS equals
1 2 1 2 1 2
P(Cǫ ∪ Cǫ ) = P(Cǫ ) + P(Cǫ ) − P(Cǫ ∩ Cǫ )
1 2
≤ P(Cǫ ) + P(Cǫ ).
Hence by Chernoﬀ’s bound
1 2
P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ P(Cǫ ) + P(Cǫ )
f ∈F
≤ 2 · 2 exp(−2mǫ2).

The Union Bound

Similarly, if F = {f1, . . . , fn}, we have
1 n
P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ · · · ∪ Cǫ ),
f ∈F
and n
1 n
P(Cǫ ∪ · · · ∪ Cǫ ) ≤ i
P(Cǫ ).
i=1
Use Chernoﬀ for each summand, to get an extra factor n in the
bound.
i
Note: this becomes an equality if and only if all the events Cǫ
involved are disjoint.

o

Inﬁnite Function Classes

• Note: empirical risk only refers to m points. On these points,
the functions of F can take at most 2m values
• for Remp, the function class thus “looks” ﬁnite
• how about R?
• need to use a trick

o

Symmetrization

Lemma 1 (Vapnik & Chervonenkis (e.g., [46, 12]))
For mǫ2 ≥ 2 we have
′
P{ sup (R[f ]−Remp[f ]) > ǫ} ≤ 2P{ sup (Remp[f ]−Remp[f ]) > ǫ/2}
f ∈F f ∈F
Here, the ﬁrst P refers to the distribution of iid samples of
size m, while the second one refers to iid samples of size 2m.
In the latter case, Remp measures the loss on the ﬁrst half of
′
the sample, and Remp on the second half.

o

Shattering Coefficient

• Hence, we only need to consider the maximum size of F on 2m
points. Call it N(F, 2m).
• N(F, 2m) = max. number of different outputs (y1, . . . , y2m)
that the function class can generate on 2m points — in other
words, the max. number of different ways the function class can
separate 2m points into two classes.
• N(F, 2m) ≤ 22m
• if N(F, 2m) = 22m, then the function class is said to shatter
2m points.

o

Putting Everything Together

We now use (1) symmetrization, (2) the shattering coeﬃcient, and
(3) the union bound, to get

P{sup(R[f ] − Remp[f ]) > ǫ}
f ∈F
′
≤ 2P{sup(Remp[f ] − Remp[f ]) > ǫ/2}
f ∈F
′ ′
= 2P{(Remp[f1] − Remp[f1]) > ǫ/2 ∨. . .∨ (Remp[fN(F,2m)] − Remp[fN(F,2m)]) > ǫ/2}
N(F,2m)
′
≤ 2P{(Remp[fn ] − Remp[fn]) > ǫ/2}.
n=1

o

ctd.
Use Chernoﬀ’s bound for each term:∗

1 m 1
2m  mǫ2
P ξi − ξi ≥ ǫ ≤ 2 exp − .
m m  2
i=1 i=m+1
This yields
mǫ2
P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 4N(F, 2m) exp − .
f ∈F 8
• provided that N(F, 2m) does not grow exponentially in m, this
is nontrivial
• such bounds are called VC type inequalities
• two types of randomness: (1) the P refers to the drawing of
the training examples, and (2) R[f ] is an expectation over the
drawing of test examples.
∗
Note that the fi depend on the 2m−sample. A rigorous treatment would need to use a second random-

ization over permutations of the 2m-sample, see [36].

Conﬁdence Intervals

Rewrite the bound: specify the probability with which we want R
to be close to Remp, and solve for ǫ:
With a probability of at least 1 − δ,
8 4
R[f ] ≤ Remp[f ] + ln(N(F, 2m)) + ln .
m δ
This bound holds independent of f ; in particular, it holds for the
function f m minimizing the empirical risk.

o

Discussion

• tighter bounds are available (better constants etc.)
• cannot minimize the bound over f
• other capacity concepts can be used

o

VC Entropy
On an example (x, y), f causes a loss
1
ξ(x, y, f (x)) = |f (x) − y| ∈ {0, 1}.
For a larger sample (x , y ), . .2. , (x , y ), the diﬀerent functions
1 1 m m
f ∈ F lead to a set of loss vectors
ξf = (ξ(x1, y1, f (x1)), . . . , ξ(xm, ym, f (xm))),
whose cardinality we denote by
N (F, (x1, y1) . . . , (xm, ym)) .
The VC entropy is deﬁned as
HF (m) = E [ln N (F, (x1, y1) . . . , (xm, ym))] ,
where the expectation is taken over the random generation of the
m-sample (x1, y1) . . . , (xm, ym) from P.
HF (m)/m → 0 ⇐⇒ uniform convergence of risks (hence consis-
tency)

Further PR Capacity Concepts

• exchange ’E’ and ’ln’: annealed entropy.

ann
HF (m)/m → 0 ⇐⇒ exponentially fast uniform convergence
• take ’max’ instead of ’E’: growth function.
Note that GF (m) = ln N(F, m).
GF (m)/m → 0 ⇐⇒ exponential convergence for all underlying
distributions P.
GF (m) = m · ln(2) for all m ⇐⇒ for any m, all loss vectors
can be generated, i.e., the m points can be chosen such that by
using functions of the learning machine, they can be separated
in all 2m possible ways (shattered ).
o

Structure of the Growth Function

Either GF (m) = m · ln(2) for all m ∈ N
Or there exists some maximal m for which the above is possible.
Call this number the VC-dimension, and denote it by h. For
m > h,
m
GF (m) ≤ h ln + 1 .
h

Nothing “in between” linear growth and logarithmic growth is
possible.

o

VC-Dimension: Example

Half-spaces in R2:
f (x, y) = sgn(a + bx + cy), with parameters a, b, c ∈ R

• Clearly, we can shatter three non-collinear points.
• But we can never shatter four points.
• Hence the VC dimension is h = 3 (in this case, equal to the
number of parameters)
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x x
x xxxxxxxxxxxxxxxxxxxxxxxxxx
x
x x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x x
x

xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
x x
x
x
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
x
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
x
x
x xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
x x
x x xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx
xxxxxxxxxxxx

o

A Typical Bound for Pattern Recognition

For any f ∈ F and m > h, with a probability of at least 1 − δ,

h log 2m + 1 − log(δ/4)
h
R[f ] ≤ Remp[f ] +
m
holds.

• does this mean, that we can learn anything?
• The study of the consistency of ERM has thus led to concepts
and results which lets us formulate another induction principle
(structural risk minimization)

o

SRM

error R(f* )
bound on test error

capacity term

training error

h
structure
Sn−1 Sn Sn+1

o

Finding a Good Function Class

• recall: separating hyperplanes in R2 have a VC dimension of 3.
• more generally: separating hyperplanes in RN have a VC di-
mension of N + 1.
• hence: separating hyperplanes in high-dimensional feature
spaces have extremely large VC dimension, and may not gener-
alize well
• however, margin hyperplanes can still have a small VC dimen-
sion

o

Kernels and Feature Spaces

Preprocess the data with
Φ:X → H
x → Φ(x),
where H is a dot product space, and learn the mapping from Φ(x)
to y [6].

• usually, dim(X) ≪ dim(H)
• “Curse of Dimensionality”?
• crucial issue: capacity, not dimensionality

o

Example: All Degree 2 Monomials

Φ : R2 → R3 √
2 , 2 x x , x2 )
(x1, x2) → (z1, z2, z3) := (x1 1 2 2
x
z3
2

H

H
H
H
x1 H
H
H
H

H
H
HH

z1
H H H
H

z2

o

General Product Feature Space

How about patterns x ∈ RN and product features of order d?
Here, dim(H) grows like N d.
E.g. N = 16 × 16, and d = 5 −→ dimension 1010

o

The Kernel Trick, N = d = 2

√ √
2)(x′2, 2 x′ x′ , x′2)⊤
Φ(x), Φ(x′) = (x2,
1 2 x1 x2 , x2 1 1 2 2
2
= x, x′
= : k(x, x′)

−→ the dot product in H can be computed in R2

o

The Kernel Trick, II

More generally: x, x′ ∈ RN , d ∈ N:
 d
N
x, x ′ d =  xj · x′ 
j
j=1
N
= xj1 · · · · · xjd · x′ 1 · · · · · x′ d = Φ(x), Φ(x′) ,
j j
j1,...,jd=1
where Φ maps into the space spanned by all ordered products of
d input directions

o

Mercer’s Theorem

If k is a continuous kernel of a positive deﬁnite integral oper-
ator on L2(X) (where X is some compact space),

k(x, x′)f (x)f (x′) dx dx′ ≥ 0,
X
it can be expanded as
∞
k(x, x′) = λiψi(x)ψi(x′)
i=1
using eigenfunctions ψi and eigenvalues λi ≥ 0 [30].

o

The Mercer Feature Map

In that case √ 
√λ1ψ1(x)
Φ(x) :=  λ2ψ2(x) 
.
.
satisﬁes Φ(x), Φ(x′) = k(x, x′).
Proof: √  √ ′)

√λ1ψ1(x) √λ1ψ1(x′
Φ(x), Φ(x′) =  λ2ψ2(x)  ,  λ2ψ2(x ) 
.
. .
.
∞
= λiψi(x)ψi(x′) = k(x, x′)
i=1
o

Positive Definite Kernels

It can be shown that the admissible class of kernels coincides with
the one of positive definite (pd) kernels: kernels which are sym-
metric (i.e., k(x, x′) = k(x′, x)), and for
• any set of training points x1, . . . , xm ∈ X and
• any a1, . . . , am ∈ R
satisfy
aiaj Kij ≥ 0, where Kij := k(xi, xj ).
i,j
K is called the Gram matrix or kernel matrix.
If for pairwise distinct points, i,j aiaj Kij = 0 =⇒ a = 0, call
it strictly positive definite.
o

The Kernel Trick — Summary

• any algorithm that only depends on dot products can beneﬁt
from the kernel trick
• this way, we can apply linear methods to vectorial as well as
non-vectorial data
• think of the kernel as a nonlinear similarity measure
• examples of common kernels:
Polynomial k(x, x′) = ( x, x′ + c)d
Gaussian k(x, x′) = exp(− x − x′ 2/(2 σ 2))
• Kernels are also known as covariance functions [54, 52, 55, 29]

o

Properties of PD Kernels, 1

Assumption: Φ maps X into a dot product space H; x, x′ ∈ X

Kernels from Feature Maps.
k(x, x′) := Φ(x), Φ(x′) is a pd kernel on X × X.

Kernels from Feature Maps, II
K(A, B) := x∈A,x′∈B k(x, x′),
where A, B are ﬁnite subsets of X, is also a pd kernel
˜
(Hint: use the feature map Φ(A) := x∈A Φ(x))

o

Properties of PD Kernels, 2 [36, 39]

Assumption: k, k1, k2, . . . are pd; x, x′ ∈ X
k(x, x) ≥ 0 for all x (Positivity on the Diagonal)
k(x, x′)2 ≤ k(x, x)k(x′, x′) (Cauchy-Schwarz Inequality)
(Hint: compute the determinant of the Gram matrix)

k(x, x) = 0 for all x =⇒ k(x, x′) = 0 for all x, x′ (Vanishing Diagonals)

The following kernels are pd:
• αk, provided α ≥ 0
• k1 + k2
• k(x, x′) := limn→∞ kn(x, x′), provided it exists
• k1 · k2
• tensor products, direct sums, convolutions [22]
o

The Feature Space for PD Kernels [4, 1, 35]

• deﬁne a feature map
Φ : X → RX
x → k(., x).
E.g., for the Gaussian kernel: Φ

. .
x x' Φ(x) Φ(x')

Next steps:
• turn Φ(X) into a linear space
• endow it with a dot product satisfying
Φ(x), Φ(x′) = k(x, x′), i.e., k(., x), k(., x′ ) = k(x, x′)
• complete the space to get a reproducing kernel Hilbert space
o

Turn it Into a Linear Space

Form linear combinations
m
f (.) = αik(., xi),
i=1
m′
g(.) = βj k(., x′ )
j
j=1
(m, m′ ∈ N, αi, βj ∈ R, xi, x′ ∈ X).
j

o

Endow it With a Dot Product

m m′
f, g := αiβj k(xi, x′ )
j
i=1 j=1
m m′
= αig(xi) = βj f (x′ )
j
i=1 j=1
• This is well-deﬁned, symmetric, and bilinear (more later).
• So far, it also works for non-pd kernels

o

The Reproducing Kernel Property

Two special cases:
• Assume
f (.) = k(., x).
In this case, we have
k(., x), g = g(x).
• If moreover
g(.) = k(., x′),
we have
k(., x), k(., x′ ) = k(x, x′).

k is called a reproducing kernel
(up to here, have not used positive deﬁniteness)
o

Endow it With a Dot Product, II

• It can be shown that ., . is a p.d. kernel on the set of functions
{f (.) = m αik(., xi)|αi ∈ R, xi ∈ X} :
i=1

γi γj f i , f j = γi f i , γj f j =: f, f
ij i j

= αik(., xi), αik(., xi) = αiαj k(xi, xj ) ≥ 0
i i ij
• furthermore, it is strictly positive deﬁnite:
f (x)2 = f, k(., x) 2 ≤ f, f k(., x), k(., x)
hence f, f = 0 implies f = 0.
• Complete the space in the corresponding norm to get a Hilbert
space Hk .

The Empirical Kernel Map

Recall the feature map
Φ : X → RX
x → k(., x).
• each point is represented by its similarity to all other points
• how about representing it by its similarity to a sample of points?

Consider
Φm : X → Rm
x → k(., x)|(x1 ,...,xm) = (k(x1, x), . . . , k(xm, x))⊤

o

ctd.

• Φm(x1), . . . , Φm(xm) contain all necessary information about
Φ(x1), . . . , Φ(xm)
• the Gram matrix Gij := Φm(xi), Φm(xj ) satisﬁes G = K 2
where Kij = k(xi, xj )
• modify Φm to
Φw : X → Rm
m
− 1 (k(x , x), . . . , k(x , x))⊤
x → K 2 1 m
• this “whitened” map (“kernel PCA map”) satiﬁes
Φw (xi), Φw (xj ) = k(xi, xj )
m m
for all i, j = 1, . . . , m.
o

An Example of a Kernel Algorithm

Idea: classify points x := Φ(x) in feature space according to which
of the two class means is closer.
1 1
c+ := Φ(xi), c− := Φ(xi)
m+ m−
yi=1 yi=−1

+
o

o . +
w + c2
o c
c1 x-c
o
x

Compute the sign of the dot product between w := c+ − c− and
x − c.
o

An Example of a Kernel Algorithm, ctd. [36]

 

f (x) = sgn 1 Φ(x), Φ(xi) −
1
Φ(x), Φ(xi) +b
m+ m−
{i:yi=+1} {i:yi=−1}
 

= sgn  1 k(x, xi) −
1
k(x, xi) + b
m+ m−
{i:yi=+1} {i:yi=−1}
where  
1 1 1
b= k(xi, xj ) − k(xi, xj ) .
2 m2− m2+
{(i,j):yi=yj =−1} {(i,j):yi=yj =+1}

• provides a geometric interpretation of Parzen windows
o

An Example of a Kernel Algorithm, ctd.

• Demo
• Exercise: derive the Parzen windows classiﬁer by computing the
distance criterion directly
• SVMs (ppt)

o

An example of a kernel algorithm, revisited

o

+ µ(Y )
.
+ w o
µ(X ) +
o

+

X compact subset of a separable metric space, m, n ∈ N.
Positive class X := {x1, . . . , xm} ⊂ X
Negative class Y := {y1, . . . , yn} ⊂ X
1 m 1 n
RKHS means µ(X) = m i=1 k(xi, ·), µ(Y ) = n i=1 k(yi, ·).
Get a problem if µ(X) = µ(Y )!
o

When do the means coincide?

k(x, x′) = x, x′ : the means coincide

k(x, x′) = ( x, x′ + 1)d: all empirical moments up to order d coincide

k strictly pd: X =Y.

The mean “remembers” each point that contributed to it.

o

Proposition 2 Assume X, Y are deﬁned as above, k is
strictly pd, and for all i, j, xi = xj , and yi = yj .
If for some αi, βj ∈ R − {0}, we have
m n
αik(xi, .) = βj k(yj , .), (1)
i=1 j=1
then X = Y .

o

Proof (by contradiction)

W.l.o.g., assume that x1 ∈ Y . Subtract n βj k(yj , .) from (1),
j=1
and make it a sum over pairwise distinct points, to get
0= γik(zi, .),
i
where z1 = x1, γ1 = α1 = 0, and
z2, · · · ∈ X ∪ Y − {x1}, γ2, · · · ∈ R.
Take the RKHS dot product with j γj k(zj , .) to get
0= γiγj k(zi, zj ),
ij
with γ = 0, hence k cannot be strictly pd.

o

The mean map

m
1
µ : X = (x1, . . . , xm) → k(xi, ·)
m
i=1
satisﬁes
m m
1 1
µ(X), f = k(xi, ·), f = f (xi)
m m
i=1 i=1
and
m n
1 1
µ(X)−µ(Y ) = sup | µ(X) − µ(Y ), f | = sup f (xi) − f (yi) .
f ≤1 f ≤1 m i=1
n i=1

Note: Large distance = can ﬁnd a function distinguishing the
samples
o

Witness function

µ(X)−µ(Y )
f = µ(X)−µ(Y ) , thus f (x) ∝ µ(X) − µ(Y ), k(x, .) ):
Witness f for Gauss and Laplace data
1
f
0.8 Gauss
Laplace
0.6
Prob. density and f

0.4

0.2

0

−0.2

−0.4
−6 −4 −2 0 2 4 6
X

This function is in the RKHS of a Gaussian kernel, but not in the
RKHS of the linear kernel.
o

The mean map for measures

p, q Borel probability measures,
Ex,x′∼p[k(x, x′)], Ex,x′∼q [k(x, x′)] ∞ ( k(x, .) ≤ M ∞ is sufficient)

Define
µ : p → Ex∼p[k(x, ·)].
Note
µ(p), f = Ex∼p[f (x)]
and
µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] .
f ≤1
Recall that in the finite sample case, for strictly p.d. kernels, µ
was injective — how about now?
[43, 17]
o

Theorem 3 [15, 13]
p = q ⇐⇒ sup Ex∼p(f (x)) − Ex∼q (f (x)) = 0,
f ∈C(X)
where C(X) is the space of continuous bounded functions on
X.
Combine this with
µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] .
f ≤1
Replace C(X) by the unit ball in an RKHS that is dense in C(X)
— universal kernel [45], e.g., Gaussian.
Theorem 4 [19] If k is universal, then
p = q ⇐⇒ µ(p) − µ(q) = 0.
o

• µ is invertible on its image
M = {µ(p) | p is a probability distribution}
(the “marginal polytope”, [53])
• generalization of the moment generating function of a RV x
with distribution p:
Mp(.) = Ex∼p e x, · .
This provides us with a convenient metric on probability distribu-
tions, which can be used to check whether two distributions are
diﬀerent — provided that µ is invertible.

o

Fourier Criterion

Assume we have densities, the kernel is shift invariant (k(x, y) =
k(x − y)), and all Fourier transforms below exist.
Note that µ is invertible iﬀ

k(x − y)p(y) dy = k(x − y)q(y) dy =⇒ p = q,
i.e.,
ˆp ˆ
k(ˆ − q ) = 0 =⇒ p = q
(Sriperumbudur et al., 2008)

ˆ
E.g., µ is invertible if k has full support. Restricting the class of
ˆ
distributions, weaker conditions suﬃce (e.g., if k has non-empty in-
terior, µ is invertible for all distributions with compact support).
o

Fourier Optics

Application: p source of incoherent light, I indicator of a ﬁnite
ˆ
aperture. In Fraunhofer diﬀraction, the intensity image is ∝ p∗ I 2.
ˆ
Set k = I 2, then this equals µ(p).
ˆ
This k does not have full support, thus the imaging process is not
invertible for the class of all light sources (Abbe), but it is if we
restrict the class (e.g., to compact support).

o

Application 1: Two-sample problem [19]

X, Y i.i.d. m-samples from p, q, respectively.

2
µ(p) − µ(q) =Ex,x′∼p [k(x, x′)] − 2Ex∼p,y∼q [k(x, y)] + Ey,y′∼q [k(y, y ′)]
=Ex,x′∼p,y,y′∼q [h((x, y), (x′, y ′))]
with
h((x, y), (x′, y ′)) := k(x, x′) − k(x, y ′) − k(y, x′) + k(y, y ′).
Deﬁne
D(p, q)2 := Ex,x′∼p,y,y′∼q h((x, y), (x′, y ′))
ˆ
D(X, Y )2 := 1 h((xi, yi), (xj , yj )).
m(m−1)
i=j

ˆ
D(X, Y )2 is an unbiased estimator of D(p, q)2.
It’s easy to compute, and works on structured data.
o

Theorem 5 Assume k is bounded.
1
ˆ
D(X, Y )2 converges to D(p, q)2 in probability with rate O(m− 2 ).
This could be used as a basis for a test, but uniform convergence bounds are often loose..
√ ˆ
Theorem 6 We assume E h2 ∞. When p = q, then m(D(X, Y )2 − D(p, q)2)
converges in distribution to a zero mean Gaussian with variance
2
σu = 4 Ez (Ez′ h(z, z ′ ))2 − Ez,z′ (h(z, z ′ ))
2
.
ˆ ˆ
When p = q, then m(D(X, Y )2 − D(p, q)2) = mD(X, Y )2 converges in distribution to
∞
λl ql2 − 2 , (2)
l=1

where ql ∼ N(0, 2) i.i.d., λi are the solutions to the eigenvalue equation
˜
k(x, x′)ψi (x)dp(x) = λi ψi(x′ ),
X
˜
and k(xi, xj ) := k(xi, xj ) − Exk(xi, x) − Exk(x, xj ) + Ex,x′ k(x, x′) is the centred RKHS
kernel.

o

Application 2: Dependence Measures

Assume that (x, y) are drawn from pxy , with marginals px, py .

Want to know whether pxy factorizes.
[2, 16]: kernel generalized variance

[20, 21]: kernel constrained covariance, HSIC

Main idea [25, 34]:
x and y independent ⇐⇒ ∀ bounded continuous functions f, g,
we have Cov(f (x), g(y)) = 0.

o

k kernel on X × Y.

µ(pxy ) := E(x,y)∼pxy [k((x, y), ·)]
µ(px × py ) := Ex∼px,y∼py [k((x, y), ·)] .

Use ∆ := µ(pxy ) − µ(px × py ) as a measure of dependence.

For k((x, y), (x′ , y ′)) = kx(x, x′)ky (y, y ′):
∆2 equals the Hilbert-Schmidt norm of the covariance opera-
tor between the two RKHSs (HSIC), with empirical estimate
m−2 tr HKxHKy , where H = I − 1/m [20, 44].

o

Witness function of the equivalent optimisation problem:
Dependence witness and sample
1.5

0.05

1 0.04

0.03
0.5
0.02

0.01
Y

0
0

−0.5 −0.01

−0.02

−1 −0.03

−0.04
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
X

Application: learning causal structures (Sun et al., ICML 2007; Fuku-
mizu et al., NIPS 2007))
o

Application 3: Covariate Shift Correction and Local
Learning

training set X = {(x1, y1), . . . , (xm, ym)} drawn from p,
test set X ′ = (x′ , y1), . . . , (x′ , yn) from p′ = p.
1
′
n
′

Assume py|x = p′ .
y|x

[40]: reweight training set

o

Minimize
2
m
βik(xi, ·) − µ(X ′) +λ β 2 subject to βi ≥ 0,
2 βi = 1.
i=1 i
Equivalent QP:
1 ⊤
minimize β (K + λ1) β − β ⊤l
β 2
subject to βi ≥ 0 and βi = 1,
i
where Kij := k(xi, xj ), li = k(xi, ·), µ(X ′) .
Experiments show that in underspeciﬁed situations (e.g., large ker-
nel widths), this helps [23].
X ′ = x′ leads to a local sample weighting scheme.
o

The Representer Theorem

Theorem 7 Given: a p.d. kernel k on X × X, a training set
(x1, y1), . . . , (xm, ym) ∈ X × R, a strictly monotonic increasing
real-valued function Ω on [0, ∞[, and an arbitrary cost function
c : (X × R2)m → R ∪ {∞}
Any f ∈ Hk minimizing the regularized risk functional
c ((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) + Ω ( f ) (3)
admits a representation of the form
m
f (.) = αik(xi, .).
i=1

o

Remarks

• signiﬁcance: many learning algorithms have solutions that can
be expressed as expansions in terms of the training examples
• original form, with mean squared loss
m
1
c((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) = (yi − f (xi))2,
m
i=1
and Ω( f ) = λ f 2 (λ 0): [27]
• generalization to non-quadratic cost functions: [10]
• present form: [36]

o

Proof

Decompose f ∈ H into a part in the span of the k(xi, .) and an
orthogonal one:
f= αik(xi, .) + f⊥,
where for all j i
f⊥, k(xj , .) = 0.
Application of f to an arbitrary training point xj yields
f (xj ) = f, k(xj , .)

= αik(xi, .) + f⊥, k(xj , .)
i
= αi k(xi, .), k(xj , .) ,
i
independent of f⊥.
o

Proof: second part of (3)

Since f⊥ is orthogonal to i αik(xi, .), and Ω is strictly mono-
tonic, we get
Ω( f ) = Ω αik(xi, .) + f⊥
i

= Ω αik(xi, .) 2 + f⊥ 2
i

≥ Ω αik(xi, .) , (4)
i
with equality occuring if and only if f⊥ = 0.
Hence, any minimizer must have f⊥ = 0. Consequently, any
solution takes the form
f= αik(xi, .).
i
o

Application: Support Vector Classiﬁcation

Here, yi ∈ {±1}. Use
1
c ((xi, yi, f (xi))i) = max (0, 1 − yif (xi)) ,
λ
i
and the regularizer Ω ( f ) = f 2.
λ → 0 leads to the hard margin SVM

o

Further Applications

Bayesian MAP Estimates. Identify (3) with the negative log
posterior (cf. Kimeldorf Wahba, 1970, Poggio Girosi, 1990),
i.e.
• exp(−c((xi, yi, f (xi))i)) — likelihood of the data
• exp(−Ω( f )) — prior over the set of functions; e.g., Ω( f ) =
λ f 2 — Gaussian process prior [55] with covariance function
k
• minimizer of (3) = MAP estimate
Kernel PCA (see below) can be shown to correspond to the case
of 
2
1 1
c((xi, yi, f (xi))i=1,...,m) = 0 if m i f (xi) − m j f (xj ) = 1

 ∞ otherwise

with g an arbitrary strictly monotonically increasing function.

Conclusion

• the kernel corresponds to
– a similarity measure for the data, or
– a (linear) representation of the data, or
– a hypothesis space for learning,
• kernels allow the formulation of a multitude of geometrical algo-
rithms (Parzen windows, 2-sample tests, SVMs, kernel PCA,...)

o

Kernel PCA [37]

linear PCA k(x,y) = (x.y)
R2
x
x
x
xx
x
x x x
x
x
x

kernel PCA k(x,y) = (x.y)d
R2
x x
x
x x x

x x x
xx
x
x
x x x x x x x
x x
x x
k H
Φ

o

Kernel PCA, II

m
1
x1, . . . , xm ∈ X, Φ : X → H, C= Φ(xj )Φ(xj )⊤
m
j=1
Eigenvalue problem
m
1
λV = CV = Φ(xj ), V Φ(xj ).
m
j=1
For λ = 0, V ∈ span{Φ(x1), . . . , Φ(xm)}, thus
m
V= αiΦ(xi),
i=1
and the eigenvalue problem can be written as
λ Φ(xn), V = Φ(xn), CV for all n = 1, . . . , m
o

Kernel PCA in Dual Variables

In term of the m × m Gram matrix
Kij := Φ(xi), Φ(xj ) = k(xi, xj ),
this leads to
mλKα = K 2α
where α = (α1, . . . , αm)⊤.
Solve
mλα = Kα
−→ (λn, αn)

Vn, Vn = 1 ⇐⇒ λn αn, αn = 1
thus divide α n by √λ
n
o

Feature extraction

Compute projections on the Eigenvectors
m
Vn = n
αi Φ(xi)
i=1
in H:

for a test point x with image Φ(x) in H we get the features
m
Vn, Φ(x) = n
αi Φ(xi), Φ(x)
i=1
m
= n
αi k(xi, x)
i=1
o

The Kernel PCA Map

Recall
Φw : X → Rm
m
1
− 2 (k(x , x), . . . , k(x , x))⊤
x → K 1 m
If K = U DU ⊤ is K’s diagonalization, then K −1/2 =
U D−1/2U ⊤. Thus we have
Φw (x) = U D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤.
m
We can drop the leading U (since it leaves the dot product invari-
ant) to get a map
Φw CA(x) = D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤.
KP
The rows of U ⊤ are the eigenvectors αn of K, and the entries of
−1/2
the diagonal matrix D−1/2 equal λi .
o

Toy Example with Gaussian Kernel

k(x, x′) = exp − x − x′ 2

o

Super-Resolution (Kim, Franz, Sch¨lkopf, 2004)
o

a. original image of resolution b. low resolution image (264 × c. bicubic interpolation d. supervised example-based f. unsupervised KPCA recon-
528 × 396 198) stretched to the original learning based on nearest neigh- struction
scale bor classiﬁer

g. enlarged portions of a-d, and f (from left to right)

Comparison between diﬀerent super-resolution methods.
o

Support Vector Classiﬁers

input space feature space

G N
N
N
N Φ
G
G G
G
G

[6]

o

Separating Hyperplane

w, x + b 0

N

G N
G

N
w, x + b 0 w N

G
G
G
{x | w, x + b = 0}

o

Optimal Separating Hyperplane [50]

N

G N
G

N
.
w N

G
G
G
{x | w, x + b = 0}

o

Eliminating the Scaling Freedom [47]

Note: if c = 0, then
{x| w, x + b = 0} = {x| cw, x + cb = 0}.
Hence (cw, cb) describes the same hyperplane as (w, b).
Deﬁnition: The hyperplane is in canonical form w.r.t. X ∗ =
{x1, . . . , xr } if minxi∈X | w, xi + b| = 1.

o

Canonical Optimal Hyperplane

{x | w, x + b = +1}
{x | w, x + b = −1} Note:
N
w, x1 + b = +1
H N x1 yi = +1 w, x2 + b = −1
x2H
= w , (x1−x2) = 2
N
, w
yi = −1 w N = , (x1−x2) = 2

||w|| ||w||

H
H
H
{x | w, x + b = 0}

o

Canonical Hyperplanes [47]

Note: if c = 0, then
{x| w, x + b = 0} = {x| cw, x + cb = 0}.
Hence (cw, cb) describes the same hyperplane as (w, b).
Deﬁnition: The hyperplane is in canonical form w.r.t. X ∗ =
{x1, . . . , xr } if minxi∈X | w, xi + b| = 1.

Note that for canonical hyperplanes, the distance of the closest
point to the hyperplane (“margin”) is 1/ w :
minxi∈X w ,x + b = 1 .
w i w w

o

Theorem 8 (Vapnik [46]) Consider hyperplanes w, x = 0
where w is normalized such that they are in canonical form
w.r.t. a set of points X ∗ = {x1, . . . , xr }, i.e.,
min | w, xi | = 1.
i=1,...,r
The set of decision functions fw(x) = sgn x, w deﬁned on
X ∗ and satisfying the constraint w ≤ Λ has a VC dimension
satisfying
h ≤ R2Λ2.
Here, R is the radius of the smallest sphere around the origin
containing X ∗.

o

x

x x R x

x
γ1

γ2

o

Proof Strategy (Gurvits, 1997)

Assume that x1, . . . , xr are shattered by canonical hyperplanes
with w ≤ Λ, i.e., for all y1, . . . , yr ∈ {±1},
yi w, xi ≥ 1 for all i = 1, . . . , r. (5)
Two steps:
• prove that the more points we want to shatter (5), the larger
r
i=1 yixi must be
r
• upper bound the size of i=1 yixi in terms of R
Combining the two tells us how many points we can at most shat-
ter.

o

Part I

Summing (5) over i = 1, . . . , r yields
 
r
w,  yi xi  ≥ r.
i=1
By the Cauchy-Schwarz inequality, on the other hand, we have
 
r r r
w,  yi xi  ≤ w yixi ≤ Λ yixi .
i=1 i=1 i=1
Combine both:
r
r
≤ yi xi . (6)
Λ
i=1
o

Part II

Consider independent random labels yi ∈ {±1}, uniformly dis-
tributed (Rademacher variables).
 
2  
r r r
E yi xi  = E  yi xi , yj xj 
 
i=1 i=1 j=1
    
r
= E  y i x i ,  yj xj  + yixi 
i=1 j=i
  
r
=  E yixi, yj xj  + E [ yixi, yixi ]
i=1 j=i
r r
= E yi xi 2 = xi 2
i=1 i=1
o

Part II, ctd.

Since xi ≤ R, we get
 
2
r
E yixi  ≤ rR2.
 
i=1

• This holds for the expectation over the random choices of the
labels, hence there must be at least one set of labels for which
it also holds true. Use this set.
Hence
2
r
yi xi ≤ rR2.
i=1
o

Part I and II Combined

r 2
Part I: Λ ≤ r
yi xi 2
i=1
Part II: r
i=1 yixi 2 ≤ rR2

Hence
r2
≤ rR2,
Λ2
i.e.,
r ≤ R2Λ2,
completing the proof.

o

Introduction to Machine Learning

Introduction to Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Introduction to Machine Learning

Similar to Introduction to Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Introduction to Machine Learning