2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quintessential Pillar of Modern Data Science - Ernest Fokoué, October 10, 2018

Foundations of Statistical Learning Theory
Quintessential Pillar of Modern Data Science
Ernest Fokou´e
Ǳ Ý
School of Mathematical Sciences
Rochester Institute of Technology
Rochester, New York, USA
Delivered by invitation of the
Statistical and Mathematical Sciences Institute (SAMSI)
Modern Mathematics Workshop (MMW 2018)
San Antonio, Texas, USA
October 10, 2018
Ǳ Ý Data Science MMW 2018 October 10, 2018 1 / 127

Acknowledgments
I wish to express my grateful thanks
and sincere gratitude to the Director
of SAMSI, Prof. Dr. David Banks,
for kindly inviting me and granting
me the golden opportunity to present
at the 2018 Modern Mathematics
Workshop in San Antonio.
I hope and pray that my modest
contribution will inspire and empower
all the attendees of my mini course.

Basic Introduction to Statistical Machine Learning
Roadmap: This lecture will provide you with the basic elements of an
introduction to the foundational concepts of statistical machine learning.
Among other things, we’ll touch on foundational concepts such as:
Input space, output space, function space, hypothesis space, loss
function, risk functional, theoretical risk, empirical risk, Bayes Risk,
training set, test set, model complexity, generalization error,
approximation error, Estimation error, bounds on the generalization
error, regularization, etc ...
Relevant websites
http://www.econ.upf.edu/∼lugosi/mlss slt.pdf
https://en.wikipedia.org/wiki/Reproducing kernel Hilbert space
Kernel Machines http://www.kernel-machines.org/
R Software project website: http://www.r-project.org

Traditional Pattern Recognition Applications
Statistical Machine Learning Methods and Techniques have been
successfully applied to wide variety of important fields. Amongst others:
1 The famous and somewhat ubiquitous handwritten digit recognition.
This data set is also known as MNIST, and is usually the first task in
some Data Analytics competitions. This data set is from USPS and
was first made popular by Yann LeCun, the co-inventor of Deep
Learning.
2 More recently, text mining and specific topic of text
categorization/classification has made successful use of statistical
machine learning.
3 Credit Scoring is another application that has been connected with
statistical machine learning
4 Disease diagnostics has also been tackled using statistical machine
learning
Other applications include: audio processing, speaker recognition and
speaker identification.

Handwritten Digit Recognition
Handwritten digit recognition is a fascinating problem that captured the
attention of the machine learning and neural network community for many
years, and has remained a benchmark problem in the ﬁeld.
0
1:28
1
1:28
2
1:28
3
1:28
4
1:28
5
1:28
6
1:28
7
1:28
8
1:28
9
1:28

Below is a portion of the benchmark training set
Note: The challenge here is building classiﬁcation techniques that
accurately classify handwritten digits taken from the test set.

Pattern Recognition (Classiﬁcation) data set
pregnant glucose pressure triceps insulin mass pedigree age diabetes
6 148 72 35 0 33.60 0.63 50 pos
1 85 66 29 0 26.60 0.35 31 neg
8 183 64 0 0 23.30 0.67 32 pos
1 89 66 23 94 28.10 0.17 21 neg
0 137 40 35 168 43.10 2.29 33 pos
5 116 74 0 0 25.60 0.20 30 neg
3 78 50 32 88 31.00 0.25 26 pos
10 115 0 0 0 35.30 0.13 29 neg
2 197 70 45 543 30.50 0.16 53 pos
8 125 96 0 0 0.00 0.23 54 pos
4 110 92 0 0 37.60 0.19 30 neg
10 168 74 0 0 38.00 0.54 34 pos
10 139 80 0 0 27.10 1.44 57 neg
1 189 60 23 846 30.10 0.40 59 pos
What are the factors responsible for diabetes?
library(mlbench); data(PimaIndiansDiabetes)

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 Class
0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 n
0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 n
0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 n
0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 ei
0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 ie
0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 ie
0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 ei
1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 n
0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 n
0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 n
0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 ie
1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 n
1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 ie
What are the indicators that control of promoter genes in the DNA?
library(mlbench); data(DNA)

Class X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 x11 X12 X13 X14
+ g c c t t c t c c a a a a c
+ a t g c a a t t t t t t a g
+ c c g t t t a t t t t t t c
+ t c t c a a c g t a a c a c
+ t a g g c a c c c c a g g c
+ a t a t a a a a a a g t t c
+ c a a g g t a g a a t g c t
+ t t a g c g g a t c c t a c
+ c t g c a a t t t t t c t a
+ t g t a a a c t a a t g c c
+ c a c t a a t t t a t t c c
+ a g g g g c a a g g a g g a
+ c c a t c a a a a a a a t a
+ a t g c a t t t t t c c g c
+ t c a g a a a t a t t a t g
What are the indicators that control of promoter genes in the DNA?
library(kernlab); data(promotergene)

Statistical Speaker Accent Recognition
Consider Xi = (xi1, · · · , xip)⊤ ∈ Rp and Yi ∈ {−1, +1}, and the set
D = (X1, Y1), (X2, Y2), · · · , (Xn, Yn)
where
Yi =
+1 if person i is a Native US
−1 if person i is a Non Native US
and Xi = (xi1, · · · , xip)⊤ ∈ Rp is the time domain representation of
his/her reading of an English sentence. The design matrix is
X =








x11 x12 · · · · · · · · · · · · x1j · · · x1p
...
...
...
... · · · · · · · · · · · ·
...
xi1 xi2 · · · · · · · · · · · · xij · · · xip
...
...
...
... · · · · · · · · · · · ·
...
xn1 xn2 · · · · · · · · · · · · xnj · · · xnp









Statistical Speaker Accent Recognition
Consider this design matrix
X =








x11 x12 · · · · · · · · · · · · x1j · · · x1p
...
...
...
... · · · · · · · · · · · ·
...
xi1 xi2 · · · · · · · · · · · · xij · · · xip
...
...
...
... · · · · · · · · · · · ·
...
xn1 xn2 · · · · · · · · · · · · xnj · · · xnp








At RIT, we recently collected voices from n = 117 people.
Each sentence required about 11 seconds to be read.
At a sampling rate of 441000 Hz, each sentence requires a vector of
dimension roughly p=540000 in the time domain.
We therefore have a gravely underdetermined system with X ∈ IRn×p
where n ≪ p. Here, n=117 and p=540000.

Binary Classiﬁcation in the Plane, X ⊂ R2
Given {(x1, y1), · · · , (xn, yn)}, with xi ∈ X ⊂ R2 and yi ∈ {−1, +1}
−20 −10 0 10
−20−15−10−5051015
x1
x2
What is the ”best” classiﬁer f∗ that separates the red from the green?

Motivating Binary Classiﬁcation in the Plane
For the binary classiﬁcation problem introduced earlier:
– A collection {(x1, y1), · · · , (xn, yn)} of i.i.d. observations is given
xi ∈ X ⊂ Rp
, i = 1, · · · , n. X is the input space.
yi ∈ {−1, +1}. Y = {−1, +1} is the output space.
– What is the probability law that governs the (xi, yi)’s?
– What is the functional relationship between x and y? Namely one
considers mappings
f : X → Y
x → f(x),
– What is the ”best” approach to determining from the available
observations, the relationship f between x and y in such a way that,
given a new (unseen) observation xnew, its class ynew can be
predicted by f(xnew) as accurately and precisely as possible, that is,
with the smallest possible discrepancy.

Basic Remarks on Classification
While some points clearly belong to one of the classes, there are other
points that are either strangers in a foreign land, or are positioned in
such a way that no automatic classification rule can clearly determine
their class membership.
One can construct a classification rule that puts all the points in their
corresponding classes. Such a rule would prove disastrous in
classifying new observations not present in the current collection of
observations.
Indeed, we have a collection of pairs (xi, yi) of observations coming
from some unknown distribution P(x, y).

Basic Remarks on Classification
Finding an automatic classification rule that achieves the absolute
very best on the present data is not enough since infinitely many more
observations can be generated by P(x, y) for which good classification
will be required.
Even the universally best classifier will make mistakes.
Of all the functions in YX , it is reasonable to assume that there is a
function f∗ that maps any x ∈ X to its corresponding y ∈ Y, i.e.,
f∗ : X → Y
x → f∗(x),
with the minimum number of mistakes.

Theoretical Risk Minimization
Let f denote any generic function mapping an element x of X to its
corresponding image f(x) in Y.
Each time x is drawn from P(x), the disagreement between the image
f(x) and the true image y is called the loss, denoted by ℓ(y, f(x)).
The expected value of this loss function with respect to the
distribution P(x, y) is called the risk functional of f. Generically, we
shall denote the risk functional of f by R(f), so that
R(f) = E[ℓ(Y, f(X))] = ℓ(y, f(x))dP(x, y).
The best function f∗ over the space YX of all measurable functions
from X to Y is therefore
f∗
= arg inf
f
R(f),
so that
R(f∗
) = R∗
= inf
f
R(f).

On the need to reduce the search space
Unfortunately, f∗ can only be found if P(x, y) is known. Therefore,
since we do not know P(x, y) in practice, it is hopeless to determine
f∗.
Besides, trying to find f∗ without the knowledge of P(x, y) implies
having to search the infinite dimensional function space YX of all
mappings from X to Y, which is an ill-posed and computationally
nasty problem.
Throughout this lecture, we will seek to solve the more reasonable
problem of choosing from a function space F ⊂ YX , the one function
f· ∈ F that best estimates the dependencies between x and y.
It is therefore important to define what is meant by best estimates.
For that, the concepts of loss function and risk functional need to be
define.

Loss and Risk in Pattern Recognition
For this classification/pattern recognition, the so-called 0-1 loss function
defined below is used. More specifically,
ℓ(y, f(x)) = 1{Y =f(X)} =
0 if y = f(x),
1 if y = f(x).
(1)
The corresponding risk functional is
R(f) = ℓ(y, f(x))dP(x, y) = E 1{Y =f(X)} = Pr
(X,Y )∼P
[Y = f(X)].
The minimizer of the 0-1 risk functional over all possible classifiers is the
so-called Bayes classifier which we shall denote here by f∗ given by
f∗
= arg inf
f
Pr
(X,Y )∼P
[Y = f(X)] .
Specifically, the Bayes’ classifier f∗ is given by the posterior probability of
class membership, namely
f∗
(x) = arg max
y∈Y
Pr[Y = y|x] .

Bayes Learner for known situations
If p(x|y = +1) = MVN(x, µ+1, Σ) and p(x|y = −1) = MVN(x, µ−1, Σ), the
Bayes classiﬁer f∗, the classiﬁer that achieves the Bayes risk, coincides
with the population Linear Discriminant Analysis (LDA), fLDA, which, for
any new point x, yields the predicted class
f∗
(x) = fLDA(x) = sign β0 + β⊤
x ,
where
β = Σ−1
(µ+1 − µ−1),
and
β0 = −
1
2
(µ+1 + µ−1)⊤
Σ−1
(µ+1 − µ−1) + log
π+1
π−1
,
with π+1 = Pr[Y = +1] and π−1 = 1 − π+1 representing the prior
probabilities of class membership.

Bayes Risk for known situations
î
Bayes Risk in Binary Classiﬁcation under Gaussian Class Conditional
Densities with common covariance matrix: Let x = (x1, x2, · · · , xp)⊤ be a
p-dimensional vector coming from either class +1 or class −1. Let f be a
function (classiﬁer) that seeks to map x to y ∈ {−1, +1} as accurately as
possible. Let R∗ = min
f
{Pr[f(X) = Y ]} be the Bayes Risk, i.e. the
smallest error rate among all possible f. If p(x|y = +1) = MVN(x, µ+1, Σ)
and p(x|y = −1) = MVN(x, µ−1, Σ), then
R∗
= R(f∗
) = Φ(−
√
∆/2) =
−
√
∆/2
−∞
1
√
2π
e− 1
2
z2
dz,
with
∆ = (µ+1 − µ−1)⊤
Σ−1
(µ+1 − µ−1).

Loss Functions for Classiﬁcation
With f : X −→ {−1, +1}, and h ∈ H such that f(x) = sign(h(x))
Zero-one (0/1) loss
ℓ(y, f(x)) = 1(y = f(x)) = 1(yh(x) < 0)
Hinge loss
ℓ(y, f(x)) = max(1 − yh(x), 0) = (1 − yh(x))+
Logistic loss
ℓ(y, f(x)) = log(1 + exp(−yh(x)))
Exponential loss
ℓ(y, f(x)) = exp(−yh(x))

Loss Functions for Classiﬁcation
With f : X −→ {−1, +1}, and h ∈ H such that f(x) = sign(h(x))
Zero-one (0/1) loss
ℓ(y, f(x)) = 1(yh(x) < 0)
Hinge loss
ℓ(y, f(x)) = max(1 − yh(x), 0)
Logistic loss
ℓ(y, f(x)) = log(1 + exp(−yh(x)))
Exponential loss
ℓ(y, f(x)) = exp(−yh(x))
−3 −2 −1 0 1 2 3
01234
yh(x)
δ(yh(x))
hinge loss
squared loss
logistic loss
exponential
zero−one loss

Loss Functions for Regression
With f : X −→ IR, and f ∈ H.
ℓ1 loss
ℓ(y, f(x)) = |y − f(x)|
ℓ2 loss
ℓ(y, f(x)) = |y − f(x)|2
ε-insensitive ℓ1 loss
ℓ(y, f(x)) = |y − f(x)| − ε
ε-insensitive ℓ2 loss
ℓ(y, f(x)) = |y − f(x)|2
− ε
−3 −2 −1 0 1 2 3
0.00.51.01.52.0
y − f(x)
l(y,f(x))
epsi−l1 loss
epsi−l2 loss
squared loss
absolute loss

Function Class in Pattern Recognition
As stated earlier, trying to ﬁnd f∗ is hopeless. One needs to select a
function space F ⊂ YX , and then choose the best estimator f+ from F,
i.e.,
f+
= arg inf
f∈F
R(f),
so that
R(f+
) = R+
= inf
f∈F
R(f).
For the binary pattern recognition problem, one may consider ﬁnding the
best linear separating hyperplane, i.e.
F = f : X → {−1, +1}| ∃α0 ∈ R, (α1, · · · , αp)⊤
= α ∈ Rp
|
f(x) = sign α⊤
x + α0 , ∀x ∈ X

Empirical Risk Minimization
Let D = (X1, Y1), · · · , (Xn, Yn) be an iid sample from P(x, y).
The empirical version of the risk functional is
R(f) =
1
n
n
i=1
1{Yi=f(Xi)}
We therefore seek the best by empirical standard,
f = arg min
f∈F
1
n
n
i=1
1{Yi=f(Xi)}
Since it is impossible to search all possible functions, it is usually
crucial to choose the ”right” function space F.

Bias-Variance Trade-Oﬀ
In traditional statistical estimation, one needs to address at the very least
issues like: (a) the Bias of the estimator; (b) the Variance of the
estimator; (c) The consistency of the estimator; Recall from elementary
point estimation that, if θ is the true value of the parameter to be
estimated, and θ is a point estimator of θ, then one can decompose the
total error as follows:
θ − θ = θ − E[θ]
Estimation error
+ E[θ] − θ
Bias
(2)
Under the squared error loss, one seeks θ that minimizes the mean squared
error,
θ = arg min
θ∈Θ
E[(θ − θ)2
] = arg min
θ∈Θ
MSE(θ),
rather than trying to ﬁnd the minimum variance unbiased estimator
(MVUE).

Bias-Variance Trade-off
Clearly, the traditional so-called bias-variance decomposition of the MSE
reveals the need for bias-variance trade-off. Indeed,
MSE(θ) = E[(θ − θ)2
] = E[(θ − E[θ])2
] + E[(E[θ] − θ)2
]
= variance + bias2
If the estimator θ were to be sought from all possible value of θ, then it
might make sense to hope for the MVUE. Unfortunately - an especially in
function estimation as we clearly argued earlier - there will be some bias,
so that the error one gets has a bias component along with the variance
component in the squared error loss case. If the bias is too small, then an
estimator with a larger variance is obtained. Similarly, a small variance will
tend to come from estimators with a relatively large bias. The best
compromise is then to trade-off bias and variance. Which is in functional
terms translates into trade-off between approximation error and estimation
error.

Bias-Variance Trade-oﬀ
Optimal Smoothing
Less smoothing
Bias squared
True Risk
More smoothing
Variance
Figure: Illustration of the qualitative behavior of the dependence of bias versus variance on a tradeoﬀ parameter such as λ
or h. For small values the variability is too high; for large values the bias gets large.

Structural risk minimization principle
Since making the estimator of the function arbitrarily complex causes the
problems mentioned earlier, the intuition for a trade-oﬀ reveals that instead
of minimizing the empirical risk Rn(f) one should do the following:
Choose a collection of function spaces {Fk : k = 1, 2, · · · }, maybe a
collection of nested spaces (increasing in size)
Minimize the empirical risk in each class
Minimize the penalized empirical risk
min
k
min
f∈Fk
Rn(f) + penalty(k, n)
where penalty(k, n) gives preference to models with small estimation error.
It is important to note that penalty(k, n) measures the capacity of the
function class Fk. The widely used technique of regularization for solving
ill-posed problem is a particular instance of structural risk minimization.

Regularization for Complexity Control
Tikhonov’s Variation Approach to Regularization[Tikhonov, 1963]
Find f that minimizes the functional
R(reg)
n (f) =
1
n
n
i=1
ℓ(yi, f(xi)) + λΩ(f)
where λ > 0 is some predeﬁned constant.
Ivanov’s Quasi-solution Approach to Regularization[Ivanov, 1962]
Rn(f) =
1
n
n
i=1
ℓ(yi, f(xi))
subject to the constraint
Ω(f) ≤ C
where C > 0 is some predeﬁned constant.

Regularization for Complexity Control
Philips’ Residual Approach to Regularization[Philips, 1962]
Ω(f)
subject to the constraint
1
n
n
i=1
ℓ(yi, f(xi)) ≤ µ
where µ > 0 is some predeﬁned constant.
In all the above, the functional Ω(f) is called the regularization functional.
Ω(f) is deﬁned in such a way that it controls the complexity of the
function f.
Ω(f) = f 2
=
b
a
(f′′
(t))2
dt.
is a regularization functional used in spline smoothing.

Support Vector Machines and the Hinge Loss
Let’s consider h(x) = w⊤x + b, w ∈ IRp
, b ∈ IR and the classiﬁer
f(x) = sign(h(x)) = sign(w⊤
x + b).
Recall the hinge loss deﬁned as
ℓ(y, f(x)) = (1−yh(x))+ =
0 if yh(x) > 0 correct prediction
1 − yh(x) if yh(x) < 0 wrong prediction
−4 −2 0 2 4
012345
yf(x)
hinge(y,f(x))

Support Vector Machines and the Hinge Loss
The Support Vector Machine classiﬁer can be formulated as
Minimize E(w, b) =
1
n
n
i=1
(1 − yi(w⊤
xi + b))+
subject to
w 2
2 < τ.
Which is equivalent in regularized (lagrangian) form to
(w, b) = arg min
w∈Rq
1
n
n
i=1
(1 − yi(w⊤
xi + b))+ + λ w 2
2
The SVM linear binary classiﬁcation estimator is given by
fn(x) = sign(h(x)) = sign(w⊤
x + b)
where w and b are estimators of w and b respectively.

Classiﬁcation realized with Linear Boundary
SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1
Figure: Linear SVM classiﬁer with a relatively small margin

Classiﬁcation realized with Linear Boundary
SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1
Figure: Linear SVM classiﬁer with a relatively large margin

SVM Learning via Quadratic Programming
When the decision boundary is nonlinear, the αi’s in the expression of
the support vector machine classiﬁer ˆf are determined by solving the
following quadratic programming problem
Maximize E(α) =
n
i=1
αi −
1
2
n
i=1
n
j=1
αiαjyiyjK(xi, xj).
subject to
0 ≤ αi ≤ C (i = 1, · · · , n) and
n
i=1
αiyi = 0.
The above formulation is an instance of the general QP
Maximize −
1
2
α⊤
Qα + 1⊤
α
subject to
α⊤
y = 0 and αi ∈ [0, C], ∀i ∈ [n].
n×nǱ Ý Data Science MMW 2018 October 10, 2018 38 / 127

SVM Learning via Quadratic Programming in R
The quadratic programming problem
Maximize −
1
2
α⊤
Qα + 1⊤
α
subject to α⊤y = 0 and αi ∈ [0, C], ∀i ∈ [n]. is equivalent to
Minimize
1
2
α⊤
Qα − 1⊤
α
subject to α⊤y = 0 and αi ∈ [0, C], ∀i ∈ [n].
Which is solved with the R package kernlab via the function ipop()
Minimize c⊤
α +
1
2
α⊤
Hα
subject to b ≤ Aα ≤ b + r and l ≤ α ≤ u.

Support Vector Machines and Kernels
As a result of the kernelization, the SVM classiﬁer delivers for each x,
the estimated response
fn(x) = sign


|s|
j=1
ˆαsj ysj K(xsj , x) + ˆb


where sj ∈ {1, 2, · · · , n}, s = {s1, s2, · · · , s|s|} and |s| ≪ n.
The kernel K(·, ·) is a bivariate function K : X × X −→ IR+ such
that given xl, xm ∈ X, the value of
K(xl, xm) = Φ(xl), Φ(xm) = Φ(xl)⊤
Φ(xm)
represents the similarity between xl and xm, and corresponds to an
implicit inner product in some feature space Z of dimension higher
than dim(X), where the decision boundary is conveniently a large
margin separating hyperplane.
Trick: There is never any need in practice to explicitly manipulated
the higher dimensional feature mapping Φ : X −→ Z.

Classiﬁcation realized with Nonlinear Boundary
SVM Optimal Separating and Margin Hyperplanes
Figure: Nonlinear SVM classiﬁer with a relatively small margin

Interplay between the aspects of statistical learning

Statistical Consistency
Deﬁnition: Let θn be an estimator of some scalar quantity θ based
on an i.i.d. sample X1, X2, · · · , Xn from the distribution with
parameter θ. Then, θn is said to be a consistent estimator of θ, if θn
converges in probability to θ, i.e.,
θn
P
−→
n→∞
θ.
In other words, θn is a consistent estimator of θ if, ∀ǫ > 0,
lim
n→∞
Pr |θn − θ| > ǫ = 0.
It turns out that for unbiased estimators θn, consistency is
straightforward as direct consequence of a basic probabilistic
inequality like Chebyshev’s inequality. However, for unbiased
estimators, one has to be more careful.

A Basic Important Inequality
ê¦
(Biename-Chebyshev’s inequality) Let X be a random variable with ﬁnite
mean µX = E[X] i.e. |E[X]| < +∞ and ﬁnite variance σ2
X = V(X) , i.e.,
|V(X)| < +∞. Then, ∀ǫ > 0,
Pr[|X − E[X]| > ǫ] ≤
V(X)
ǫ2
.
It is therefore easy to see here that, with unbiased θn, one has E[θn] = θ,
and the result is immediate. For the sake of clarity, let’s recall here the
elementary weak law of large numbers.

Weak Law of Large Numbers
Let X be a random variable with ﬁnite mean µX = E[X] i.e.
|E[X]| < +∞ and ﬁnite variance σ2
X = V(X) , i.e., |V(X)| < +∞. Let
X1, X2, · · · , Xn be a random sample of n observations drawn
independently from the distribution of X, so that for i = 1, · · · , n, we
have E[Xi] = µ and V[Xi] = σ2 . Let ¯Xn be the sample mean, i.e.,
¯Xn =
1
n
(X1 + X2 + · · · + Xn) =
1
n
n
i=1
Xi
Then, clearly, E[ ¯Xn] = µ, and, ∀ǫ > 0,
lim
n→∞
Pr[| ¯Xn − µ| > ǫ] = 0. (3)
This essentially expresses the fact that the empirical mean ¯Xn converges
in probability to the theoretical mean µ in the limit of very large samples.

We therefore have
¯Xn
P
−→
n→∞
µ.
With µ ¯X = E[ ¯Xn] = µ and σ2
¯X
= σ2/n, one applyies
Biename-Chebyshev’s inequality and gets: ∀ǫ > 0,
Pr[| ¯X − µ| > ǫ] ≤
σ2
nǫ2
, (4)
which, by inversion, is the same as
| ¯X − µ| <
1
δ
σ2
n
(5)
with probability at least 1 − δ.
Why is all the above of any interest to statistical learning theory?

Why is all the above of any interest to statistical learning theory?
Equation (3) states the much needed consistency of ¯X as an
estimator of µ.
Equation (4), by showing the dependence of on n and ε helps assess
the rate at which ¯X converges to µ.
Equation (5), by showing a conﬁdence interval helps compute bounds
on the unknown true mean µ as a function of the empirical mean ¯X
and the conﬁdence level 1 − δ.
Finally, how does go about constructing estimators with all the above
properties.

Eﬀect of Bias-Variance Dilemma of Prediction
Optimal Prediction achieved at the point of bias-variance trade-oﬀ.

Theoretical Aspects of Statistical Learning
For binary classiﬁcation using the so-called 0/1 loss function, the
Vapnik-Chervonenkis inequality takes the form
P sup
f∈F
| ˆRn(f) − R(f)| > ε ≤ 8S(F, n)e−nε2/32
(6)
which is also expression in terms of expectation as
E sup
f∈F
| ˆRn(f) − R(f)| ≤ 2
log S(F, n) + log 2
n
(7)
The quantity S(F, n) plays an important role of the CV Theory and
will explored in greater details later.
Note that these bounds including the one presented earlier in the VC
Fundamental Machine Learning Theorem are not asymptotic bounds.
They hold for any n.
The bounds are nice and easy if h or S(F, n) is known.
Unfortunately the bound may exceed 1, making it useless.

Components of Statistical Machine Learning
Interestingly, all those 4 components of classical estimation theory, will be
encountered again in statistical learning theory. Essentially, the 4
components of statistical learning theory consist of ﬁnding the answers to
the following questions:
(a) What are the necessary and suﬃcient conditions for the
consistency of a learning process based on the ERM principle? This
leads to the Theory of consistency of learning processes.
(b) How fast is the rate of convergence of the learning process? This
leads to the Nonasymptotic theory of the rate of convergence of
learning processes;
(c) How can one control the rate of convergence (the generalization
ability) of the learning process?. This leads to the Theory of
controlling the generalization ability of learning processes;
(d) How can one construct algorithms that can control the
generalization ability of the learning process?. This leads to Theory of
constructing learning algorithms.

Error Decomposition revisited
A reasoning on error decomposition and consistency of estimators along
with rates, bounds and algorithms applies to function spaces: indeed, the
diﬀerence between the true risk R(fn) associated with fn and the overall
minimum risk R∗ can be decomposed to explore in greater details the
source of error in the function estimation process:
R(fn) − R∗
= R(fn) − R(f+
)
Estimation error
+ R(f+
) − R∗
Approximation error
(8)
A reasoning similar to bias-variance trade-oﬀ and consistency can be
made, with the added complication brought be the need to distinguish
between the true risk functional and the empirical risk functional, and also
to the added to assess both pointwise behaviors and uniform behaviors. In
a sense, one needs to generalize the decomposition and the law of large
numbers to function spaces.

Approximation-Estimation Trade-Oﬀ
Optimal Smoothing
Less smoothing
Bias squared
True Risk
More smoothing
Variance
Figure: Illustration of the qualitative behavior of the dependence of bias versus
variance on a tradeoﬀ parameter such as λ or h. For small values the variability is
too high; for large values the bias gets large.

Consistency of the Empirical Risk Minimization principle
The ERM principle is consistent if it provides a sequence of functions
ˆfn, n = 1, 2, · · · for which both the expected risk R(fn) and the
empirical risk Rn(fn) converge to the minimal possible value of the
risk R(f+) in the function class under consideration, i.e.,
R( ˆfn)
P
−→
n→∞
inf
f∈F
R(f) = R(f+
)
and
Rn( ˆfn)
P
−→
n→∞
inf
f∈F
R(f) = R(f+
)
Vapnik discusses the details of this theorem at length, and extends
the exploration to include the diﬀerence between what he calls trivial
consistency and non-trivial consistency.

To better understand consistency in function spaces, consider the
sequence of random variables
ξn
= sup
f∈F
R(f) − Rn(f) , (9)
and consider studying
lim
n→∞
P sup
f∈F
R(f) − Rn(f) > ε = 0, ∀ε > 0.
Vapnik shows that the sequence of the means of the random variable
ξn converges to zero as the number n of observations increases.
He also remarks that the sequence of random variables ξn converges
in probability to zero if the set of functions F, contains a ﬁnite
number m of elements. We will show that later in the case of pattern
recognition.

It remains then to describe the properties of the set of functions F,
and probability measure P(x, y) under which the sequence of random
variables ξn converges in probability to zero.
lim
n→∞
P sup
f∈F
[R(f) − Rn(f)] > ε or sup
f∈F
[Rn(f) − R(f)] > ε = 0.
Recall that Rn(f) is the realized disagreement between classifier f
and the truth about the label y of x based on information contained
in the sample D.
It is easy to see that, for a given (fixed) function (classifier) f,
E[Rn(f)] = R(f). (10)
Note that while this pointwise unbiasedness of the empirical risk is a
good bottomline property to have, it is not enough. More is needed
as the comparison is against R(f+) or event better yet R(f∗).

Consistency of the Empirical Risk
Remember that the goal of statistical function estimation is to devise
a technique (strategy) that chooses from the function class F, the
one function whose true risk is as close as possible to the lowest risk
in class F.
The question arises: since one cannot calculate the true error, how
can one devise a learning strategy for choosing classifiers based on it?
Tentative answer: At least devise strategies that yield functions for
which the upper bound on the theoretical risk is as tight as possible,
so that one can make confidence statements of the form:
With probability 1 − δ over an i.i.d. draw of some sample according
to the distribution P, the expected future error rate of some classifier
is bounded by a function g(δ, error rate on sample) of δ and the error
rate on sample.
Pr TestError ≤ TrainError + φ(n, δ, κ(F)) ≥ 1 − δ

Foundation Result in Statistical Learning Theory
Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of
functions implementing so learning machines, and let ζ = V Cdim(F) be
the VC dimension of F. Let the theoretical and the empirical risks be
deﬁned as earlier and consider any data distribution in the population of
interest. Then ∀f ∈ F, the prediction error (theoretical risk) is bounded by
R(f) ≤ ˆRn(f) +
ζ log 2n
ζ + 1 − log η
4
n
(11)
with probability of at least 1 − η. or
Pr TestError ≤ TrainError +
ζ log 2n
ζ + 1 − log η
4
n
≥ 1 − η

Optimism of the Training Error
5 10 15 20
0.000.050.100.15
Complexity
Predictionerror
E[Training Error]
E[Test Error]

Bounds on the Generalization Error
For instance, using Chebyshev’s inequality and the fact that
E[Rn(f)] = R(f), it is easy to see that, for given classiﬁer f and a sample
D = {(x1, y1), · · · , (xn, yn)},
Pr[|Rn(f) − R(f)| > ǫ] ≤
R(f)(1 − R(f))
nǫ2
.
To estimate the true but unknown error R(f) with a probability of at least
1 − δ, it makes sense to use inversion, i.e., set
δ =
R(f)(1 − R(f))
nǫ2
, so that ǫ =
R(f)(1 − R(f))
nδ
.
Owing to the fact that max
R(f)∈[0,1]
R(f)(1 − R(f)) = 1
4 , we have
R(f)(1 − R(f))
nδ
<
1
4nδ
=
1
4nδ
1/2
.

Based on Chebyshev’s inequality, for a given classifier f, with a
probability of at least 1 − δ, the bound on the difference between the
true risk R(f) and the empirical risk Rn(f) is given by
|Rn(f) − R(f)| <
1
4nδ
1/2
.
Recall that one of the goals of statistical learning theory is to assess
the rate of convergence of the empirical risk to the true risk, which
translates into assessing how tight the corresponding bounds on the
true risk are.
In fact, it turns out many bounds can be so loose as to become
useless. It turns out that the above Chebyshev-based bound is not a
good one, at least compared to bounds obtained using the so-called
hoeffding’s inequality.

Theorem:(Hoeﬀding’s inequality) Let Z1, Z2, · · · , Zn be a collection
of i.i.d random variables with Zi ∈ [a, b]. Then, ∀ǫ > 0,
Pr
1
n
n
i=1
Zi − E[Z] > ǫ ≤ 2 exp
−2nǫ2
(b − a)2
corollary:(hoeﬀding’s inequality for sample proportions) Let
Z1, Z2, · · · , Zn be a collection of i.i.d random variables from a
Bernoulli distribution with ”success” probability p. Let
pn = 1
n
n
i=1 Zi. Clearly, pn ∈ [0, 1] and E[pn] = p.
Therefore, as a direct consequence of the above theorem, we have,
∀ǫ > 0,
Pr [|pn − p| > ǫ] ≤ 2 exp −2nǫ2

So we have, ∀ǫ > 0,
Pr [|pn − p| > ǫ] ≤ 2 exp −2nǫ2
Now, setting δ = 2 exp(−2ǫ2n), it is straightforward to see that the
hoeffding-based 1 − δ level confidence bound on the difference
between R(f) and Rn(f) for a fixed classifier f is given by
|Rn(f) − R(f)| <
ln 2
δ
2n
1/2
.
Which of the two bounds is tighter? Clearly, we need to find out
which of ln 2/δ or 1/2δ is larger. This is the same as comparing
exp(1/2δ) and 2/δ, which in turns means comparing a(2/δ) and 2/δ
where a = exp(1/4). With δ > 0, a(2/δ) > 2/δ, so that, we know
that hoeffding’s bounds are tighter. The graph also confirm this.

0 2000 4000 6000 8000 10000 12000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
n = Sample size
Theoreticalboundf(n,δ)
Chernoff vs Chebyshev bounds for proportions: delta = 0.01
Chernoff
Chebyshev
0 2000 4000 6000 8000 10000 12000
0
0.05
0.1
0.15
0.2
0.25
n = Sample size
Theoreticalboundf(n,δ)
Chernoff vs Chebyshev bounds for proportions: delta = 0.05
Chernoff
Chebyshev

Beyond Chernov and Hoeﬀding
In all the above, we only addressed pointwise convergence of
Rn(f) to R(f), i.e., for Fix a machine f ∈ F, we studied the
convergence of
Rn(f) to R(f).
Needless to mention that that pointwise convergence is of very little
use here.
A more interesting issue to address is uniform convergence. That is,
for all machines, f ∈ F, determine the necessary and suﬃcient
conditions for the convergence of
sup
f∈F
|Rn(f) − R(f)| > ǫ to 0.
Clearly, such a study extends the Law of Large Numbers to function
spaces, thereby providing tools for the construction of bounds on the
theoretical errors of learning machines.

Since uniform convergence requires the consideration of the entirety
of the function space of interest, care needs to be taken regarding the
dimensionality of the function space.
Uniform convergence will prove substantially easier to handle for finite
sample spaces than for infinite dimensional function spaces.
Indeed, infinity dimensional spaces, one will need to introduce such
concepts of the capacity of the function space, measured through
devices such as the VC-dimension and covering numbers.

Theorem: If Rn(f) and R(f) are close for all f ∈ F, i.e., ∀ǫ > 0,
sup
f∈F
|Rn(f) − R(f)| ≤ ǫ,
then
R(fn) − R(f+
) ≤ 2ǫ.
Proof:Recall that we did deﬁne fn as the best function that is yielded by
the empirical risk Rn(f) in the function class F. Recall also that Rn(fn)
can be made as small as possible as we saw earlier. Therefore, with f+
being the best true risk in class F, we always have
Rn(f+
) − Rn(fn) ≥ 0.
As a result,
R(fn) = R(fn) − R(f+
) + R(f+
)
= Rn(f+
) − Rn(fn) + R(fn) − R(f+
) + R(f+
)
≤ 2sup
f∈F
|R(f) − Rn(f)| + R(f+
)

Proof:Recall that we did deﬁne fn as the best function that is yielded by
the empirical risk Rn(f) in the function class F. Recall also that Rn(fn)
can be made as small as possible as we saw earlier. Therefore, with f+
being the best true risk in class F, we always have
Rn(f+
) − Rn(fn) ≥ 0.
As a result,
R(fn) = R(fn) − R(f+
) + R(f+
)
= Rn(f+
) − Rn(fn) + R(fn) − R(f+
) + R(f+
)
≤ 2sup
f∈F
|R(f) − Rn(f)| + R(f+
)
Consequently,
R(fn) − R(f+
) ≤ 2sup
f∈F
|R(f) − Rn(f)|
as required.

Corollary: A direct consequence of the above theorem is the following:
For a given machine f ∈ F,
R(f) ≤ Rn(f) +
ln 2
δ
2n
1/2
with probability at least 1 − δ, ∀δ > 0.
If the function class F is ﬁnite, ie
F = {f1, f2, · · · , fm}
where m = |F| = #F = Number of functions in the class F then it
can be shown that, for all f ∈ F,
R(f) ≤ Rn(f) +
ln m + ln 2
δ
2n
1/2
with probability at least 1 − δ, ∀δ > 0.

It can also be shown that
R( ˆfn) ≤ Rn(f+
) + 2
ln m + ln 2
δ
2n
1/2
(12)
with probability at least 1 − δ, ∀δ > 0, where as before
f+
= arg inf
f∈F
R(f) and ˆfn = argmin
f∈F
Rn(f).
Equation (12) is of foundational importance, because it reveals clearly
that the size of the function class controls the uniform bound on the
crucial generalization error: Indeed, if the size m of the function class
F increases, then R(f+) is caused to increase while R(fn) decreases,
so that the trade-oﬀ between the two is controlled by the size m of
the function class.

Vapnik-Chervonenkis Dimension
Definition: (Shattering) Let X = ∅ be any non empty domain. Let
F ⊆ 2X be any non-empty class of functions having X as their
domain. Let S ⊆ X be any finite subset of the domain X. Then S is
said to be shattered by F iff
{S ∩ f | f ∈ F} = 2S
In other words, F shatters S if any subset of S can be obtained by
intersecting S with some set from F.
Example: A class F ⊆ 2X of classifiers is said to shatter a set
x1, x2, · · · , xn of n points, if, for any possible configuration of labels
y1, y2, · · · , yn, we can find a classifier f ∈ F that reproduces those
labels.

Definition(VC-dimension) Let X = ∅ be any non empty learning
domain. Let F ⊆ 2X be any non-empty class of functions having X
as their domain. Let S ⊆ X be any finite subset of the domain X.
The VC dimension of F is the cardinality of the largest finite set
S ⊆ X that is shattered by F, ie
V Cdim(F) := max |S| : S is shattered by F
Note: If arbitrarily large finite sets are shattered by F, then
V Cdim(F) = ∞. In other words, if a small set of finite cardinality
cannot be found that is shattered by F, then V Cdim(F) = ∞.
Example: The VC dimension of a class F ⊆ 2X of classifiers is the
largest number of points that F can shatter.

Remarks: If V Cdim(F) = d, then there exists a finite set S ⊆ X
such that |S| = d and S is shattered by F. Importantly, every set
S ⊆ X such that |S| > d is not shattered by F. Clearly, we do not
expect to learn anything until we have at least d training points.
Intuitively, this means that an infinite VC dimension is not desirable
as it could imply the impossibility to learn the concept underlying any
data from the population under consideration. However, a finite VC
dimension does not guarantee the learnability of the concept
underlying any data from the population under consideration either.
Fact: Let F be any finite function (concept) class. Then, since it
requires 2d distinct concepts to shatter a set of cardinality d, no set of
cardinality greater than log |F| can be shattered. Therefore, log |F| is
always an upper bound for the VC dimension of finite concept classes.

To gain insights into the central concept of VC dimension, we herein
consider a few examples of practical interest for which the VC
dimension can be found.
VC dimension of the space of separating hyperplanes: Let
X = Rp be the domain for the binary Y ∈ {−1, +1} classiﬁcation
task, and consider using hyperplanes to separate the points of X. Let
F denote the class of all such separating hyperplanes. Then,
V Cdim(F) = p + 1
Intuitively, the following pictures for the case of X = R2 help see why
the VC dimension is p + 1.

Foundation Result in Statistical Learning Theory
Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of
functions implementing so learning machines, and let ζ = V Cdim(F) be
the VC dimension of F. Let the theoretical and the empirical risks be
deﬁned as earlier and consider any data distribution in the population of
interest. Then ∀f ∈ F, the prediction error (theoretical risk) is bounded by
R(f) ≤ ˆRn(f) +
ζ log 2n
ζ + 1 − log η
4
n
(13)
with probability of at least 1 − η. or
Pr TestError ≤ TrainError +
ζ log 2n
ζ + 1 − log η
4
n
≥ 1 − η

Conﬁdence Interval for a proportion
p ∈ ˆp − zα/2
ˆp(1−ˆp)
n , ˆp + zα/2
ˆp(1−ˆp)
n with 100(1 − α)% conﬁdence
0.0 0.2 0.4 0.6
020406080100
Building 95 % CIs. Here 98 intervals out of 100 contain p. That is 98 %
lower bound and upper bound of interval
Sampleindex

p ∈ ˆp − zα/2
ˆp(1−ˆp)
n , ˆp + zα/2
ˆp(1−ˆp)
0.2 0.4 0.6 0.8 1.0
020406080100
Sampleindex

p ∈ ˆp − zα/2
ˆp(1−ˆp)
n , ˆp + zα/2
ˆp(1−ˆp)
0.2 0.4 0.6 0.8
020406080100
Sampleindex

Conﬁdence Interval for a population mean
µ ∈ ¯x − zα/2
σ2)
n , ¯x + zα/2
σ2
n with 100 × (1 − α)% conﬁdence
8 9 10 11
020406080100
Building 95 % CIs. Here 98 intervals out of 100 contain mu. That is 98 %
Sampleindex

Conﬁdence Interval for a population mean
µ ∈ ¯x − zα/2
σ2)
n , ¯x + zα/2
σ2
n with 100 × (1 − α)% conﬁdence
8 9 10 11
020406080100
Building 85 % CIs. Here 90 intervals out of 100 contain mu. That is 90 %
Sampleindex

VC Bound for Separating Hyperplanes
î
Let L represent the function class of binary classiﬁers in q-dimension, ie
L = f : ∃w ∈ IRq
, w0 ∈ IR, f(x) = sign(w⊤
x + w0), ∀x ∈ X ,
then VCDim(L) = h = q + 1.
With labels taken from {−1, +1}, and using the 0/1 loss function, we
have the fundamental theorem from Vapnik and Chervonenkis, namely,
ê¦
For every f ∈ L, and n > h, with probability at least 1 − η, we have
R(f) ≤ Rn(f) +
h log 2n
h + 1 + log 4
η
n
The above result holds true for LDA.Ǳ Ý Data Science MMW 2018 October 10, 2018 81 / 127

Appeal of the VC Bound
Note: One of the greatest appeals of the VC bound is that, though
applicable to function classes of infinite dimension, it preserves the
same intuitive form as the bound derived for finite dimensional F.
Essentially, using the VC dimension concept, the number L of
possible labeling configurations obtainable from F with
V Cdim(F) = ζ over 2n points verifies
L ≤
en
ζ
ζ
. (14)
The VC bound is simply obtained by replacing log |F| with L in the
expression of the risk bound for finite dimensional F.
The most important part of the above theorem is the fact that the
generalization ability of a learning machine depends on both the
empirical risk and the complexity of the class of functions used, which
is measured here by the VC dimension of (Vapnik and
Chervonenkis, 1971).

Also, the bounds offered here are distribution-free, since no
assumption is made about the distribution of the population.
The details of this important result will be discussed again in chapter
6 and 7, where we will present other measures of the capacity of a
class of functions.
Remark: From the expression of the VC Bound, it is clear that an
intuitively appealing way to improve the predictive performance
(reduce prediction error) of a class of machines is to achieve a
trade-off (compromise) between small VC dimension and
minimization of the empirical risk.
At first, it may seen as if the VC bound is acting in a way similar to
the number of parameters, since it serves as a measure of the
complexity of F. In this spirit, the following is a possible guiding
principle.

At first, it may seen as if the VC bound is acting in a way similar
to the number of parameters, since it serves as a measure of the
complexity of F. In this spirit, the following is a possible guiding
principle.
Intuition: One should seek to construct a classifier that
achieves the best trade-off (balance, compromise) between
complexity of function class - measured by VC dimension- and fit
to the training data -measured by empirical risk.
Now equipped with this sound theoretical foundation one can
then go on to the implementation of varioous learning machines.
We shall use R to discover some of the most commonly learning
machines.

Regression Analysis

Regression Analysis Dataset
rating complaints privileges learning raises critical advance
43 51 30 39 61 92 45
63 64 51 54 63 73 47
71 70 68 69 76 86 48
61 63 45 47 54 84 35
81 78 56 66 71 83 47
43 55 49 44 54 49 34
58 67 42 56 66 68 35
71 75 50 55 70 66 41
72 82 72 67 71 83 31
67 61 45 47 62 80 41
64 53 53 58 58 67 34
67 60 47 39 59 74 41
69 62 57 42 55 63 25
What are the factors that drive the rating of companies?
head(attitude)

Regression Analysis Dataset
lcavol lweight age lbph svi lcp gleason pgg45 lpsa
-0.58 2.77 50 -1.39 0 -1.39 6 0 -0.43
-0.99 3.32 58 -1.39 0 -1.39 6 0 -0.16
-0.51 2.69 74 -1.39 0 -1.39 7 20 -0.16
-1.20 3.28 58 -1.39 0 -1.39 6 0 -0.16
0.75 3.43 62 -1.39 0 -1.39 6 0 0.37
-1.05 3.23 50 -1.39 0 -1.39 6 0 0.77
0.74 3.47 64 0.62 0 -1.39 6 0 0.77
0.69 3.54 58 1.54 0 -1.39 6 0 0.85
-0.78 3.54 47 -1.39 0 -1.39 6 0 1.05
0.22 3.24 63 -1.39 0 -1.39 6 0 1.05
0.25 3.60 65 -1.39 0 -1.39 6 0 1.27
-1.35 3.60 63 1.27 0 -1.39 6 0 1.27
What are the factors responsible for prostate cancer?
library(ElemStatLearn); data(prostate)

Motivating Example Regression Analysis
Consider the univariate function f ∈ C([0, 2π]) given by
f(x) =
π
2
x +
3
4
π cos
π
2
(1 + x) (15)
Simulate of an artiﬁcial iid data set D = {(xi, yi), i = 1, · · · , n}, with
n = 99 and σ = π/3
xi ∈ [0, 2π] drawn deterministically and equally spaced
Yi = f(xi) + εi
εi
iid
∼ N(0, σ2)
The R code is
f <- function(x){(pi/2)*x + (3*pi/4)*cos((pi/2)*(1+x))}
x <- seq(0, 2*pi, length=n)
y <- f(x) + rnorm(n, 0, pi/3)

Noisy data generated with function (19)
0 1 2 3 4 5 6
0510
x
y
Question: What is the best hypothesis space to learn the underlying
function?

Bias-Variance Tradeoff in Action
0 1 2 3 4 5 6
0510
x
y
(a) Underfit
0 1 2 3 4 5 6
0510
x
y
(b) Optimal fit
0510
y

Introduction to Regression Analysis
We have, xi = (xi1, · · · , xip)⊤ ∈ IRp
and Yi ∈ IR, and data set
D = (x1, Y1), (x2, Y2), · · · , (xn, Yn)
We assume that the response variable Yi is related to the explanatory
vector xi through a function f via the model,
Yi = f(xi) + ξi, i = 1, · · · , n (16)
The explanatory vectors xi are ﬁxed (non-random)
The regression function f : IRp
→ IR is unknown
The error terms ξi are iid Gaussian, i.e. ξi
iid
∼ N(0, σ2
)
Goal: We seek to estimate the function f using the data in D.

Formulation of the regression problem
Let X and Y be two random variables s.t
E[Y ] = µ and E[Y 2
] < ∞
.
Goal: Find the best predictor f(X) of Y given X.
Important Questions
How does one deﬁne ”best”?
Is the very best attainable in practice?
What does the function f look like? (Function class)
How do we select a candidate from the chosen class of functions?
How hard is it computationally to ﬁnd the desired function?

Loss functions
1 When f(X) is used to predict Y , a loss is incurred.
Question: How is such a loss quantified?
Answer: Define a suitable loss function.
2 Common loss functions in regression
Squared error loss or (ℓ2) loss
ℓ(Y, f(X)) = (Y − f(X))2
ℓ2 is by far the most used (prevalent) because of its differentiability.
Unfortunately, not very robust to outliers.
Absolute error loss or (ℓ1) loss
ℓ(Y, f(X)) = |Y − f(X)|
ℓ1 is more robust to outliers, but not differentiable at zero.
3 Note that ℓ(Y, f(X)) is a random variable.

Risk Functionals and Cost Functions
1 Deﬁnition of a risk functional,
R(f) = E[ℓ(Y, f(X))] =
X×Y
ℓ(y, f(x))pXY (x, y)dxdy
R(f) is the expected loss over all pairs of the cross space X × Y.
2 Ideally, one seeks the best out of all possible functions, i.e.,
f∗
(X) = arg min
f
R(f) = arg min
f
E[ℓ(Y, f(X))]
f∗(·) is such that
R∗
= R(f∗
) = min
f
R(f)
3 This ideal function cannot be found in practice, because the fact that
the distributions are unknown, make it impossible to form an
expression for R(f).

Cost Functions and Risk Functionals
Theorem: Under regularity conditions,
f∗
(X) = E[Y |X] = arg min
f
E[(Y − f(X))2
]
Under the squared error loss, the optimal function f∗ that yields the
best prediction of Y given X is no other than the expected value of
Y given X.
Since we know neither pXY (x, y) nor pX(x), the conditional
expectation
E[Y |X] =
Y
ypY |X(y)(dy) =
Y
y
pXY (x, y)
pX(x)
dy
cannot be directly computed.

Empirical Risk Minimization
Let D = (X1, Y1), (X2, Y2), · · · , (Xn, Yn) represent an iid sample
The empirical version of the risk functional is
R(f) = MSE(f) = E[(Y − f(X))2] =
1
n
n
i=1
(Yi − f(Xi))2
It turns out that R(f) provides an unbiased estimator of R(f).
We therefore seek the best by empirical standard,
ˆf∗
(X) = arg min
f
MSE(f) = arg min
f
1
n
n
i=1
(Yi − f(Xi))2
Since it is impossible to search all possible functions, it is usually
crucial to choose the ”right” function space.

Function spaces
For the function estimation task for instance, one could assume that the
input space X is a closed and bounded interval of IR, i.e. X = [a, b], and
then consider estimating the dependencies between x and y from within
the space F all bounded functions on X = [a, b], i.e.,
F = {f : X → IR| ∃B ≥ 0, such that |f(x)| ≤ B, for all x ∈ X}.
One could even be more speciﬁc and make the functions of the above F
continuous, so that the space to search becomes
F = {f : [a, b] → IR| f is continuous} = C([a, b]),
which is the well-known space of all continuous functions on a closed and
bounded interval [a, b]. This is indeed a very important function space.

Space of Univariate Polynomials
In fact, polynomial regression consists of searching from a function space
that is a subspace of C([a, b]). In other words, when we are doing the very
common polynomial regression, we are searching the space
P([a, b]) = {f ∈ C([a, b])| f is a polynomial with real coeﬃcients} .
It is interesting to note that Weierstrass did prove that P([a, b]) is dense in
C([a, b]). One considers the space of all polynomial of some degree p, i.e.,
F = Pp
([a, b]) = f ∈ C([a, b])| ∃β0, β1, · · · , βp ∈ IR|
f(x) =
p
j=0
βjxj
, ∀x ∈ [a, b]




Empirical Risk Minimization in F
Having chosen a class F of functions, we can now seek
ˆf(X) = arg min
f∈F
MSE(f) = arg min
f∈F
1
n
n
i=1
(Yi − f(Xi))2
We are seeking the best function in the function space chosen.
For instance, if the function space in the space of all polynomials of
degree p in some interval [a, b], ﬁnding ˆf boils down to estimating the
coeﬃcients of the polynomial using the data, namely
ˆf(x) = ˆβ0 + ˆβ1x + ˆβ2x2
+ · · · + ˆβpxp
where using β = (β0, β1, · · · , βp)⊤, we have
ˆβ = arg min
β∈IRp+1



1
n
n
i=1

Yi −
p
j=0
βjxj
i


2



Important Aspects of Statistical Learning
It is very tempting at first to use the data at hand to find/build the ˆf
that makes MSE( ˆf) is the smallest. For instance, the higher the value
of p, the smaller MSE( ˆf(·)) will get.
The estimate ˆβ = (ˆβ0, ˆβ1, · · · , ˆβp)⊤ of β = (β0, β1, · · · , βp)⊤, is a
random variable, and as a result the estimate
ˆf(x) = ˆβ0 + ˆβ1x + ˆβ2x2 + · · · + ˆβpxp of f(x) is also a random
variable.
Since ˆf(x) is random variable, we must compute important aspects
like its bias B[ ˆf(x)] = E[ ˆf(x)] − f(x) and its variance V[ ˆf(x)].
We have a dilemma: If we make ˆf complex (large p), we make the
bias small but the variance is increased. If we make ˆf simple (small
p), we make the bias large but the variance is decreased.
Most of Modern Statistical Learning is rich with model selection
techniques that seek to achieve a trade-off between bias and variance
to get the optimal model. Principle of parsimony (sparsity),
Ockham’s razor principle.

Theoretical Aspects of Statistical Regression Learning
Just like we have a VC bound for classiﬁcation, there is one for
regression, ie when Y = IR and
ˆRn(f) =
1
n
n
i=1
|yi − f(xi)|2
= Squared error loss
Indeed, for every f ∈ F, with probability at least 1 − η, we have
R(f) ≤
ˆRn(f)
(1 − c
√
δ)+
where
δ =
a
n
v + v log
bn
v
− log
η
4
Note once again as before that these bounds are not asymptotic
Unfortunately these bounds are known to be very loose in practice.

The pitfalls of memorization and overfitting
The trouble - limitation - with naively using a criterion on the whole
sample lies in the fact, given a sample (x1, y1), (x2, y2), · · · , (xn, yn), the
function ˆfmemory defined by
ˆfmemory(xi) = yi, i = 1, · · · , n
always achieves the best performance, since MSE( ˆfmemory) = 0, which is the
minimum achievable.
Where does the limitation of ˆfmemory come from? Well, ˆfmemory
does not really learn the dependency between X and Y . While it may
have some of it, it also grabs a lot of the noise in the data, and ends
overfitting the data. As a result of not really learning the structure of the
relationship between X and Y and only merely memorizing the present
sample values, ˆfmemory will predict very poorly when presented
with observations that were not in the sample.

Training Set Test Set Split
Splitting the data into training set and test set: It makes
sense to judge models (functions), not on how they perform with in
sample observations, but instead how they perform on out of sample
cases. Given a collection D = (x1, y1), (x2, y2), · · · , (xn, yn) of pairs,
Randomly split D into training set of size ntr and test set of size
nte, such that ntr + nte = n
Training set
Tr = (x
(tr)
1 , y
(tr)
1 ), (x
(tr)
2 , y
(tr)
2 ), · · · , (x
(tr)
ntr , y
(tr)
ntr )
Training set
Te = (x
(te)
1 , y
(te)
1 ), (x
(te)
2 , y
(te)
2 ), · · · , (x
(te)
nte , y
(te)
nte )

Training Set Test Set Split
For each function class F (linear models, nonparametrics, etc ...)
Find the best in its class based on the training set Tr
For all the estimated functions ˆf1, ˆf2, · · · , ˆfm, compute the training
error
MSETr( ˆfj) =
1
ntr
ntr
i=1
(y
(tr)
i − ˆfj(x
(tr)
i ))2
For all the estimated functions ˆf1, ˆf2, · · · , ˆfm, compute the test error
MSETe( ˆfj) =
1
nte
nte
i=1
(y
(te)
i − ˆfj(x
(te)
i ))2
Compute the averages of both MSETr and MSETe over many random
splits of the data, and tabulate (if necessary) those averages.
Select ˆfj∗ such that
mean[MSETe( ˆfj∗ )] < mean[MSETe( ˆfj)], j = 1, 2, · · · , m, j = j∗

Computational Comparisons
Ideally, we would like to compare the true theoretical performances
measured by the risk functional
R(f) = E[ℓ(Y, f(X))] =
X×Y
ℓ(x, y)dP(x, y), (17)
Instead, we build the estimators using other optimality criteria, and
then compare their predictive performances using the average test
error AVTE(·), namely
AVTE(f) =
1
R
R
r=1
1
m
m
t=1
ℓ(y
(r)
it
, fr(x
(r)
it
)) , (18)
where fr(·) is the r-th realization of the estimator f(·) built using the
training portion of the split of D into training set and test set, and
x
(r)
it
, y
(r)
it
is the t-th observation from the test set at the r-th
random replication of the split of D.

Learning Machines when n ≪ p
Machines Inherently designed to handle p larger than n problems
Classiﬁcation and Regression Trees
Support Vector Machines
Relevance Vector Machines (n < 500)
Gaussian Process Learning Machines (n < 500)
k-Nearest Neighbors Learning Machines (Watch for the curse of
dimensionality)
Kernel Machines in general
Machines that cannot inherently handle p larger than n problems, but
can do so if regularized with suitable constraints
Multiple Linear Regression Models
Generalized Linear Models
Discriminant Analysis
Ensemble Learning Machines
Random Subspace Learning Ensembles (Random Forest)
Boosting and its extensions

Consider the univariate function f ∈ C([−1, +1]) given by
f(x) = −x +
√
2 sin(π3/2
x2
) (19)
Simulate of an artiﬁcial iid data set D = {(xi, yi), i = 1, · · · , n}, with
n = 99 and σ = 3/10
xi ∈ [−1, +1] drawn deterministically and equally spaced
Yi = f(xi) + εi
εi
iid
∼ N(0, σ2)
The R code is
f <- function(x){-x + sqrt(2)*sin(pi^(3/2)*x^2)}
x <- seq(-1, +1, length=n)
y <- f(x) + rnorm(n, 0, 3/10)

Estimation Error and Prediction Error
1
1
1
1
1
1
1
1
1
1
11
1
1
11
1
11
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
111
11
1
11
1
11
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
−1.0 −0.5 0.0 0.5 1.0
−3−2−1012
Predictive Regression with confidence and prediction bands
xnew
f(xnew)
data points
fit
lower conf
upper conf
lower pred
upper pred
Figure: Simple Orthogonal Polynomial Regression of with both conﬁdence bands
and prediction bands on the test set. The true function is
f(x) = −x +
√
2 sin(π3/2
x2
) for x ∈ [−1, +1].

Training Error and Test Error
Table: Average Training Error and Average Test Error over m = 10 random splits
of n = 300 observations generated from a population with true function
f(x) = −x +
√
2 sin(π3/2
x2
) for x ∈ [−1, +1]. The noise variance in this case is
σ2
= 0.32
. Each split has ntr= 2n/3.
Approximating Function Class
Poly SVM RVM GPR
Average
Training Error 0.0998 0.0335 0.0295 0.1861
Test Error 0.3866 0.1465 0.1481 0.1556

Unsupervised Learning

Finding Patterns in Job Sector Allocations in Europe
Example 1: Consider the following portion of observations on job sectors
distribution in Europe in the 1990s.
Agr Min Man PS Con SI Fin SPS TC
Italy 15.9 0.6 27.6 0.5 10.0 18.1 1.6 20.1 5.7
Poland 31.1 2.5 25.7 0.9 8.4 7.5 0.9 16.1 6.9
Rumania 34.7 2.1 30.1 0.6 8.7 5.9 1.3 11.7 5.0
USSR 23.7 1.4 25.8 0.6 9.2 6.1 0.5 23.6 9.3
Denmark 9.2 0.1 21.8 0.6 8.3 14.6 6.5 32.2 7.1
France 10.8 0.8 27.5 0.9 8.9 16.8 6.0 22.6 5.7
1 Can European countries by divided into meaningful groups (clusters)?
2 How many concepts? How many clusters (groups) of countries?
Analogy: Clustering in such an example can be thought of as unsupervised
classiﬁcation (pattern recognition)

Hierarchical Clustering for European Job Sector Data
One solution: Mining Job Sectors in Europe in the 1990s via Hierarchical
Clustering with Manhattan distance and ward linkage.
Belgium
UnitedKingdom
Denmark
Sweden
Netherlands
Norway
France
Finland
Italy
Luxembourg
Austria
E.Germany
W.Germany
Switzerland
Spain
Rumania
Portugal
Poland
Czechoslovakia
Bulgaria
Hungary
Ireland
USSR
Turkey
Greece
Yugoslavia
050100150200250300350
Cluster Dendrogram
hclust (*, "ward")
dist(europe, method = "manhattan")
Height
How does the distance affect the
clustering?
How does the linkage affect the
clustering?
What makes a clustering
satisfactory? How does one compare
two clusterings?
Some interesting tasks:
1 Investigate different distances with same linkage
2 Investigate different linkages with same distance

Extracting Patterns of Voting in America
Example 2: Percentages of Votes given to the U. S. Republican
Presidential Candidate - 1856-1976.
X1856 X1860 X1864 X1868 X1900 X1904 X1908
Alabama NA NA NA 51.44 34.67 20.65 24.38
Arkansas NA NA NA 53.73 35.04 40.25 37.31
California 18.77 32.96 58.63 50.24 54.48 61.90 55.46
Colorado NA NA NA NA 42.04 55.27 46.88
Connecticut 53.18 53.86 51.38 51.54 56.94 58.13 59.43
Delaware 2.11 23.71 48.20 40.98 53.65 54.04 52.09
Florida NA NA NA NA 19.03 21.15 21.58
1 Can the states be grouped into clusters of republican-ness?
2 How do missing values inﬂuence the clustering?
Analogy: Again, clustering in such an example can be thought of as
unsupervised classiﬁcation (pattern recognition)

Example: Image Denoising
For an observed image of size r × c, posit the model
y = Wx + z. (20)
The original image is represented by a p × 1 vector, which makes the
matrix W a matrix of dimension q × p, where q = rc. We therefore have
z⊤ = (z1, · · · , zq) ∈ IRq
, x⊤ = (x1, · · · , xp) ∈ IRp
,
y⊤ = (y1, · · · , yq) ∈ IRq
.

Example: Image Denoising
Expression of the solution: If E(x) = y − Wx 2 + λ x 1 is our
objective function to be minimized, and ˆx is a point at which the
minimum is achieved, then we will write
ˆx = arg min
x∈IRp
y − Wx 2
+ λ x 1 . (21)

Example: Recommender System
Consider a system in which n customers have access to p diﬀerent
products, like movies, clothing, rental cars, etc ...
A1 A2 · · · Aj · · · Ap
C1
C2
...
Ci w(i, j)
...
Cn
Table: Typical Representation of a Recommender System
The value of w(i, j) is the rating assigned to article Aj by customer Ci.

Example: Recommender System
The main ingredient in Recommender Systems is the matrix
W =










w11 w12 · · · w1j · · · w1p
w21 w22 · · · w2j · · · w2p
...
...
...
... · · ·
...
wi1 wi2 · · · wij · · · wip
...
...
...
... · · ·
...
wn1 wn2 · · · wnj · · · wnp










The Matrix W is typical very (and I mean very) sparse, which makes
sense because people can only consume so many articles, and there
are articles some people will never consume even if some suggested.

Time Series and State Space
Models

IID Process and White Noise
Time
ts(X)
0 50 100 150 200
−2−10123
Time
ts(W)
0 50 100 150 200
−2−1012
(Left) White noise process (Right) IID Process.
What is the statistical model (if any) underlying the data?

Random Walk in 1d and 2d
Time
ts(X)
0 50 100 150 200
−4−20246810
−10 −5 0 5 10 15 20
−20−100102030
Xt
Yt
(Left) Random walk in 1 dimension (Right) Random Walk in 2
dimensions (plane).

Real life Time Series: Air Passengers and Sunspots
Time
AirPassengers
1950 1952 1954 1956 1958 1960
100200300400500600
Time
Sunspots
0 20 40 60 80 100
050100150
(Left) Number of airline passengers (Right) Longstanding Sunspots
data.

Existing Computing Tools
Do the following
install.packages(’ctv’)
library{ctv}
install.views(’MachineLearning’)
install.views(’HighPerformanceComputing’)
install.views(’TimeSeries’)
install.views(’Bayesian’)
R packages for big data
library{biglm}
library(foreach)
library(glmnet)
library(kernlab)
library(randomForest)
library(ada)
library(audio)
library(rpart)

Some Remarks and Recommendations
Applications: Sharpen your intuition and your commonsense by
questioning things, reading about interesting open applied problems,
and attempt to solve as many problems as possible
Methodology: Read and learn about the fundamental of statistical
estimation and inference, get acquainted with the most commonly
used methods and techniques, and consistently ask yourself and
others what the natural extensions of the techniques could be.
Computation: Learn and master at least two programming languages.
I strongly recommend getting acquainted with R
http://www.r-project.org
Theory: ”Nothing is more practical than a good theory” (Vladimir N.
Vapnik). When it comes to data mining and machine learning and
predictive analytics, those who truly understand the inner workings of
algorithms and methods always solve problems better.

Machine Learning CRAN Task View in R
Let’s visit the website where most of the R community goes
http://www.r-project.org
Let’s install some packages and get started
install.packages(’ctv’)
library(ctv)
install.views(’MachineLearning’)
install.views(’HighPerformanceComputing’)
install.views(’Bayesian’)
install.views(’Robust’)
Let’s load a couple of packages and explore
library(e1071)
library(MASS)
library(kernlab)

Clarke, B. and Fokou´e, E. and
Zhang, H. (2009). Principles and
Theory for Data Mining and
Machine Learning. Springer
Verlag, New York, (ISBN:
978-0-387-98134-5), (2009)

References
Clarke, B., Fokou´e, E. and Zhang, H. H. (2009). Principles and
Theory for Data Mining and Machine Learning. Springer Verlag,
New York, (ISBN: 978-0-387-98134-5), (2009)
James, G, Witten, D, Hastie, T and Tibshirani, R (2013). An
Introduction to Statistical Learning with Applications in R.
Springer, New York, (e-ISBN: 978-1-4614-7138-7),(2013)
Vapnik, N. V.(1998). Statistical Learning Theory. Wiley, ISBN:
978-0-471-03003-4, (1998)
Vapnik, N. V.(2000). The Nature of Statistical Learning Theory.
Springer, ISBN 978-1-4757-3264-1, (2000)
Hastie, T. and Tibshirani, R. and Friedman, J. (2009). The
Elements of Statistical Learning: Data Mining, Inference, and
Prediction. 2nd Edition. Springer, ISBN 978-0-387-84858-7

2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quintessential Pillar of Modern Data Science - Ernest Fokoué, October 10, 2018

Recommended

Recommended

More Related Content

Similar to 2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quintessential Pillar of Modern Data Science - Ernest Fokoué, October 10, 2018

Similar to 2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quintessential Pillar of Modern Data Science - Ernest Fokoué, October 10, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quintessential Pillar of Modern Data Science - Ernest Fokoué, October 10, 2018