This document provides an introduction to statistical machine learning and statistical learning theory. It begins by acknowledging the invitation to present and then outlines topics to be covered, including input/output spaces, loss functions, risk functionals, generalization error, and regularization. Examples of applications like handwritten digit recognition and accent recognition are presented. It discusses challenges in classification problems like imbalanced data and complex decision boundaries. The goal of statistical learning theory is to minimize theoretical risk by finding the best predictive function, while accounting for limitations like an unknown data distribution.
Similar to 2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quintessential Pillar of Modern Data Science - Ernest Fokoué, October 10, 2018
Machine Learning part 2 - Introduction to Data Science Frank Kienle
Similar to 2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quintessential Pillar of Modern Data Science - Ernest Fokoué, October 10, 2018 (20)
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quintessential Pillar of Modern Data Science - Ernest Fokoué, October 10, 2018
1. Foundations of Statistical Learning Theory
Quintessential Pillar of Modern Data Science
Ernest Fokou´e
DZ Ý
School of Mathematical Sciences
Rochester Institute of Technology
Rochester, New York, USA
Delivered by invitation of the
Statistical and Mathematical Sciences Institute (SAMSI)
Modern Mathematics Workshop (MMW 2018)
San Antonio, Texas, USA
October 10, 2018
DZ Ý Data Science MMW 2018 October 10, 2018 1 / 127
2. Acknowledgments
I wish to express my grateful thanks
and sincere gratitude to the Director
of SAMSI, Prof. Dr. David Banks,
for kindly inviting me and granting
me the golden opportunity to present
at the 2018 Modern Mathematics
Workshop in San Antonio.
I hope and pray that my modest
contribution will inspire and empower
all the attendees of my mini course.
DZ Ý Data Science MMW 2018 October 10, 2018 2 / 127
3. Basic Introduction to Statistical Machine Learning
Roadmap: This lecture will provide you with the basic elements of an
introduction to the foundational concepts of statistical machine learning.
Among other things, we’ll touch on foundational concepts such as:
Input space, output space, function space, hypothesis space, loss
function, risk functional, theoretical risk, empirical risk, Bayes Risk,
training set, test set, model complexity, generalization error,
approximation error, Estimation error, bounds on the generalization
error, regularization, etc ...
Relevant websites
http://www.econ.upf.edu/∼lugosi/mlss slt.pdf
https://en.wikipedia.org/wiki/Reproducing kernel Hilbert space
Kernel Machines http://www.kernel-machines.org/
R Software project website: http://www.r-project.org
DZ Ý Data Science MMW 2018 October 10, 2018 3 / 127
4. Traditional Pattern Recognition Applications
Statistical Machine Learning Methods and Techniques have been
successfully applied to wide variety of important fields. Amongst others:
1 The famous and somewhat ubiquitous handwritten digit recognition.
This data set is also known as MNIST, and is usually the first task in
some Data Analytics competitions. This data set is from USPS and
was first made popular by Yann LeCun, the co-inventor of Deep
Learning.
2 More recently, text mining and specific topic of text
categorization/classification has made successful use of statistical
machine learning.
3 Credit Scoring is another application that has been connected with
statistical machine learning
4 Disease diagnostics has also been tackled using statistical machine
learning
Other applications include: audio processing, speaker recognition and
speaker identification.
DZ Ý Data Science MMW 2018 October 10, 2018 4 / 127
5. Handwritten Digit Recognition
Handwritten digit recognition is a fascinating problem that captured the
attention of the machine learning and neural network community for many
years, and has remained a benchmark problem in the field.
0
1:28
1
1:28
2
1:28
3
1:28
4
1:28
5
1:28
6
1:28
7
1:28
8
1:28
9
1:28
DZ Ý Data Science MMW 2018 October 10, 2018 5 / 127
6. Handwritten Digit Recognition
Below is a portion of the benchmark training set
Note: The challenge here is building classification techniques that
accurately classify handwritten digits taken from the test set.
DZ Ý Data Science MMW 2018 October 10, 2018 6 / 127
7. Handwritten Digit Recognition
Below is a portion of the benchmark training set
Note: The challenge here is building classification techniques that
accurately classify handwritten digits taken from the test set.
DZ Ý Data Science MMW 2018 October 10, 2018 7 / 127
8. Handwritten Digit Recognition
Below is a portion of the benchmark training set
Note: The challenge here is building classification techniques that
accurately classify handwritten digits taken from the test set.
DZ Ý Data Science MMW 2018 October 10, 2018 8 / 127
11. Pattern Recognition (Classification) data set
Class X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 x11 X12 X13 X14
+ g c c t t c t c c a a a a c
+ a t g c a a t t t t t t a g
+ c c g t t t a t t t t t t c
+ t c t c a a c g t a a c a c
+ t a g g c a c c c c a g g c
+ a t a t a a a a a a g t t c
+ c a a g g t a g a a t g c t
+ t t a g c g g a t c c t a c
+ c t g c a a t t t t t c t a
+ t g t a a a c t a a t g c c
+ c a c t a a t t t a t t c c
+ a g g g g c a a g g a g g a
+ c c a t c a a a a a a a t a
+ a t g c a t t t t t c c g c
+ t c a g a a a t a t t a t g
What are the indicators that control of promoter genes in the DNA?
library(kernlab); data(promotergene)
DZ Ý Data Science MMW 2018 October 10, 2018 11 / 127
12. Statistical Speaker Accent Recognition
Consider Xi = (xi1, · · · , xip)⊤ ∈ Rp and Yi ∈ {−1, +1}, and the set
D = (X1, Y1), (X2, Y2), · · · , (Xn, Yn)
where
Yi =
+1 if person i is a Native US
−1 if person i is a Non Native US
and Xi = (xi1, · · · , xip)⊤ ∈ Rp is the time domain representation of
his/her reading of an English sentence. The design matrix is
X =
x11 x12 · · · · · · · · · · · · x1j · · · x1p
...
...
...
... · · · · · · · · · · · ·
...
xi1 xi2 · · · · · · · · · · · · xij · · · xip
...
...
...
... · · · · · · · · · · · ·
...
xn1 xn2 · · · · · · · · · · · · xnj · · · xnp
DZ Ý Data Science MMW 2018 October 10, 2018 12 / 127
13. Statistical Speaker Accent Recognition
Consider this design matrix
X =
x11 x12 · · · · · · · · · · · · x1j · · · x1p
...
...
...
... · · · · · · · · · · · ·
...
xi1 xi2 · · · · · · · · · · · · xij · · · xip
...
...
...
... · · · · · · · · · · · ·
...
xn1 xn2 · · · · · · · · · · · · xnj · · · xnp
At RIT, we recently collected voices from n = 117 people.
Each sentence required about 11 seconds to be read.
At a sampling rate of 441000 Hz, each sentence requires a vector of
dimension roughly p=540000 in the time domain.
We therefore have a gravely underdetermined system with X ∈ IRn×p
where n ≪ p. Here, n=117 and p=540000.
DZ Ý Data Science MMW 2018 October 10, 2018 13 / 127
14. Binary Classification in the Plane, X ⊂ R2
Given {(x1, y1), · · · , (xn, yn)}, with xi ∈ X ⊂ R2 and yi ∈ {−1, +1}
−20 −10 0 10
−20−15−10−5051015
x1
x2
What is the ”best” classifier f∗ that separates the red from the green?
DZ Ý Data Science MMW 2018 October 10, 2018 14 / 127
15. Motivating Binary Classification in the Plane
For the binary classification problem introduced earlier:
– A collection {(x1, y1), · · · , (xn, yn)} of i.i.d. observations is given
xi ∈ X ⊂ Rp
, i = 1, · · · , n. X is the input space.
yi ∈ {−1, +1}. Y = {−1, +1} is the output space.
– What is the probability law that governs the (xi, yi)’s?
– What is the functional relationship between x and y? Namely one
considers mappings
f : X → Y
x → f(x),
– What is the ”best” approach to determining from the available
observations, the relationship f between x and y in such a way that,
given a new (unseen) observation xnew, its class ynew can be
predicted by f(xnew) as accurately and precisely as possible, that is,
with the smallest possible discrepancy.
DZ Ý Data Science MMW 2018 October 10, 2018 15 / 127
16. Basic Remarks on Classification
While some points clearly belong to one of the classes, there are other
points that are either strangers in a foreign land, or are positioned in
such a way that no automatic classification rule can clearly determine
their class membership.
One can construct a classification rule that puts all the points in their
corresponding classes. Such a rule would prove disastrous in
classifying new observations not present in the current collection of
observations.
Indeed, we have a collection of pairs (xi, yi) of observations coming
from some unknown distribution P(x, y).
DZ Ý Data Science MMW 2018 October 10, 2018 16 / 127
17. Basic Remarks on Classification
Finding an automatic classification rule that achieves the absolute
very best on the present data is not enough since infinitely many more
observations can be generated by P(x, y) for which good classification
will be required.
Even the universally best classifier will make mistakes.
Of all the functions in YX , it is reasonable to assume that there is a
function f∗ that maps any x ∈ X to its corresponding y ∈ Y, i.e.,
f∗ : X → Y
x → f∗(x),
with the minimum number of mistakes.
DZ Ý Data Science MMW 2018 October 10, 2018 17 / 127
18. Theoretical Risk Minimization
Let f denote any generic function mapping an element x of X to its
corresponding image f(x) in Y.
Each time x is drawn from P(x), the disagreement between the image
f(x) and the true image y is called the loss, denoted by ℓ(y, f(x)).
The expected value of this loss function with respect to the
distribution P(x, y) is called the risk functional of f. Generically, we
shall denote the risk functional of f by R(f), so that
R(f) = E[ℓ(Y, f(X))] = ℓ(y, f(x))dP(x, y).
The best function f∗ over the space YX of all measurable functions
from X to Y is therefore
f∗
= arg inf
f
R(f),
so that
R(f∗
) = R∗
= inf
f
R(f).
DZ Ý Data Science MMW 2018 October 10, 2018 18 / 127
19. On the need to reduce the search space
Unfortunately, f∗ can only be found if P(x, y) is known. Therefore,
since we do not know P(x, y) in practice, it is hopeless to determine
f∗.
Besides, trying to find f∗ without the knowledge of P(x, y) implies
having to search the infinite dimensional function space YX of all
mappings from X to Y, which is an ill-posed and computationally
nasty problem.
Throughout this lecture, we will seek to solve the more reasonable
problem of choosing from a function space F ⊂ YX , the one function
f· ∈ F that best estimates the dependencies between x and y.
It is therefore important to define what is meant by best estimates.
For that, the concepts of loss function and risk functional need to be
define.
DZ Ý Data Science MMW 2018 October 10, 2018 19 / 127
20. Loss and Risk in Pattern Recognition
For this classification/pattern recognition, the so-called 0-1 loss function
defined below is used. More specifically,
ℓ(y, f(x)) = 1{Y =f(X)} =
0 if y = f(x),
1 if y = f(x).
(1)
The corresponding risk functional is
R(f) = ℓ(y, f(x))dP(x, y) = E 1{Y =f(X)} = Pr
(X,Y )∼P
[Y = f(X)].
The minimizer of the 0-1 risk functional over all possible classifiers is the
so-called Bayes classifier which we shall denote here by f∗ given by
f∗
= arg inf
f
Pr
(X,Y )∼P
[Y = f(X)] .
Specifically, the Bayes’ classifier f∗ is given by the posterior probability of
class membership, namely
f∗
(x) = arg max
y∈Y
Pr[Y = y|x] .
DZ Ý Data Science MMW 2018 October 10, 2018 20 / 127
21. Bayes Learner for known situations
If p(x|y = +1) = MVN(x, µ+1, Σ) and p(x|y = −1) = MVN(x, µ−1, Σ), the
Bayes classifier f∗, the classifier that achieves the Bayes risk, coincides
with the population Linear Discriminant Analysis (LDA), fLDA, which, for
any new point x, yields the predicted class
f∗
(x) = fLDA(x) = sign β0 + β⊤
x ,
where
β = Σ−1
(µ+1 − µ−1),
and
β0 = −
1
2
(µ+1 + µ−1)⊤
Σ−1
(µ+1 − µ−1) + log
π+1
π−1
,
with π+1 = Pr[Y = +1] and π−1 = 1 − π+1 representing the prior
probabilities of class membership.
DZ Ý Data Science MMW 2018 October 10, 2018 21 / 127
22. Bayes Risk for known situations
î
Bayes Risk in Binary Classification under Gaussian Class Conditional
Densities with common covariance matrix: Let x = (x1, x2, · · · , xp)⊤ be a
p-dimensional vector coming from either class +1 or class −1. Let f be a
function (classifier) that seeks to map x to y ∈ {−1, +1} as accurately as
possible. Let R∗ = min
f
{Pr[f(X) = Y ]} be the Bayes Risk, i.e. the
smallest error rate among all possible f. If p(x|y = +1) = MVN(x, µ+1, Σ)
and p(x|y = −1) = MVN(x, µ−1, Σ), then
R∗
= R(f∗
) = Φ(−
√
∆/2) =
−
√
∆/2
−∞
1
√
2π
e− 1
2
z2
dz,
with
∆ = (µ+1 − µ−1)⊤
Σ−1
(µ+1 − µ−1).
DZ Ý Data Science MMW 2018 October 10, 2018 22 / 127
23. Loss Functions for Classification
With f : X −→ {−1, +1}, and h ∈ H such that f(x) = sign(h(x))
Zero-one (0/1) loss
ℓ(y, f(x)) = 1(y = f(x)) = 1(yh(x) < 0)
Hinge loss
ℓ(y, f(x)) = max(1 − yh(x), 0) = (1 − yh(x))+
Logistic loss
ℓ(y, f(x)) = log(1 + exp(−yh(x)))
Exponential loss
ℓ(y, f(x)) = exp(−yh(x))
DZ Ý Data Science MMW 2018 October 10, 2018 23 / 127
24. Loss Functions for Classification
With f : X −→ {−1, +1}, and h ∈ H such that f(x) = sign(h(x))
Zero-one (0/1) loss
ℓ(y, f(x)) = 1(yh(x) < 0)
Hinge loss
ℓ(y, f(x)) = max(1 − yh(x), 0)
Logistic loss
ℓ(y, f(x)) = log(1 + exp(−yh(x)))
Exponential loss
ℓ(y, f(x)) = exp(−yh(x))
−3 −2 −1 0 1 2 3
01234
yh(x)
δ(yh(x))
hinge loss
squared loss
logistic loss
exponential
zero−one loss
DZ Ý Data Science MMW 2018 October 10, 2018 24 / 127
25. Loss Functions for Regression
With f : X −→ IR, and f ∈ H.
ℓ1 loss
ℓ(y, f(x)) = |y − f(x)|
ℓ2 loss
ℓ(y, f(x)) = |y − f(x)|2
ε-insensitive ℓ1 loss
ℓ(y, f(x)) = |y − f(x)| − ε
ε-insensitive ℓ2 loss
ℓ(y, f(x)) = |y − f(x)|2
− ε
−3 −2 −1 0 1 2 3
0.00.51.01.52.0
y − f(x)
l(y,f(x))
epsi−l1 loss
epsi−l2 loss
squared loss
absolute loss
DZ Ý Data Science MMW 2018 October 10, 2018 25 / 127
26. Function Class in Pattern Recognition
As stated earlier, trying to find f∗ is hopeless. One needs to select a
function space F ⊂ YX , and then choose the best estimator f+ from F,
i.e.,
f+
= arg inf
f∈F
R(f),
so that
R(f+
) = R+
= inf
f∈F
R(f).
For the binary pattern recognition problem, one may consider finding the
best linear separating hyperplane, i.e.
F = f : X → {−1, +1}| ∃α0 ∈ R, (α1, · · · , αp)⊤
= α ∈ Rp
|
f(x) = sign α⊤
x + α0 , ∀x ∈ X
DZ Ý Data Science MMW 2018 October 10, 2018 26 / 127
27. Empirical Risk Minimization
Let D = (X1, Y1), · · · , (Xn, Yn) be an iid sample from P(x, y).
The empirical version of the risk functional is
R(f) =
1
n
n
i=1
1{Yi=f(Xi)}
We therefore seek the best by empirical standard,
f = arg min
f∈F
1
n
n
i=1
1{Yi=f(Xi)}
Since it is impossible to search all possible functions, it is usually
crucial to choose the ”right” function space F.
DZ Ý Data Science MMW 2018 October 10, 2018 27 / 127
28. Bias-Variance Trade-Off
In traditional statistical estimation, one needs to address at the very least
issues like: (a) the Bias of the estimator; (b) the Variance of the
estimator; (c) The consistency of the estimator; Recall from elementary
point estimation that, if θ is the true value of the parameter to be
estimated, and θ is a point estimator of θ, then one can decompose the
total error as follows:
θ − θ = θ − E[θ]
Estimation error
+ E[θ] − θ
Bias
(2)
Under the squared error loss, one seeks θ that minimizes the mean squared
error,
θ = arg min
θ∈Θ
E[(θ − θ)2
] = arg min
θ∈Θ
MSE(θ),
rather than trying to find the minimum variance unbiased estimator
(MVUE).
DZ Ý Data Science MMW 2018 October 10, 2018 28 / 127
29. Bias-Variance Trade-off
Clearly, the traditional so-called bias-variance decomposition of the MSE
reveals the need for bias-variance trade-off. Indeed,
MSE(θ) = E[(θ − θ)2
] = E[(θ − E[θ])2
] + E[(E[θ] − θ)2
]
= variance + bias2
If the estimator θ were to be sought from all possible value of θ, then it
might make sense to hope for the MVUE. Unfortunately - an especially in
function estimation as we clearly argued earlier - there will be some bias,
so that the error one gets has a bias component along with the variance
component in the squared error loss case. If the bias is too small, then an
estimator with a larger variance is obtained. Similarly, a small variance will
tend to come from estimators with a relatively large bias. The best
compromise is then to trade-off bias and variance. Which is in functional
terms translates into trade-off between approximation error and estimation
error.
DZ Ý Data Science MMW 2018 October 10, 2018 29 / 127
30. Bias-Variance Trade-off
Optimal Smoothing
Less smoothing
Bias squared
True Risk
More smoothing
Variance
Figure: Illustration of the qualitative behavior of the dependence of bias versus variance on a tradeoff parameter such as λ
or h. For small values the variability is too high; for large values the bias gets large.
DZ Ý Data Science MMW 2018 October 10, 2018 30 / 127
31. Structural risk minimization principle
Since making the estimator of the function arbitrarily complex causes the
problems mentioned earlier, the intuition for a trade-off reveals that instead
of minimizing the empirical risk Rn(f) one should do the following:
Choose a collection of function spaces {Fk : k = 1, 2, · · · }, maybe a
collection of nested spaces (increasing in size)
Minimize the empirical risk in each class
Minimize the penalized empirical risk
min
k
min
f∈Fk
Rn(f) + penalty(k, n)
where penalty(k, n) gives preference to models with small estimation error.
It is important to note that penalty(k, n) measures the capacity of the
function class Fk. The widely used technique of regularization for solving
ill-posed problem is a particular instance of structural risk minimization.
DZ Ý Data Science MMW 2018 October 10, 2018 31 / 127
32. Regularization for Complexity Control
Tikhonov’s Variation Approach to Regularization[Tikhonov, 1963]
Find f that minimizes the functional
R(reg)
n (f) =
1
n
n
i=1
ℓ(yi, f(xi)) + λΩ(f)
where λ > 0 is some predefined constant.
Ivanov’s Quasi-solution Approach to Regularization[Ivanov, 1962]
Find f that minimizes the functional
Rn(f) =
1
n
n
i=1
ℓ(yi, f(xi))
subject to the constraint
Ω(f) ≤ C
where C > 0 is some predefined constant.
DZ Ý Data Science MMW 2018 October 10, 2018 32 / 127
33. Regularization for Complexity Control
Philips’ Residual Approach to Regularization[Philips, 1962]
Find f that minimizes the functional
Ω(f)
subject to the constraint
1
n
n
i=1
ℓ(yi, f(xi)) ≤ µ
where µ > 0 is some predefined constant.
In all the above, the functional Ω(f) is called the regularization functional.
Ω(f) is defined in such a way that it controls the complexity of the
function f.
Ω(f) = f 2
=
b
a
(f′′
(t))2
dt.
is a regularization functional used in spline smoothing.
DZ Ý Data Science MMW 2018 October 10, 2018 33 / 127
34. Support Vector Machines and the Hinge Loss
Let’s consider h(x) = w⊤x + b, w ∈ IRp
, b ∈ IR and the classifier
f(x) = sign(h(x)) = sign(w⊤
x + b).
Recall the hinge loss defined as
ℓ(y, f(x)) = (1−yh(x))+ =
0 if yh(x) > 0 correct prediction
1 − yh(x) if yh(x) < 0 wrong prediction
−4 −2 0 2 4
012345
yf(x)
hinge(y,f(x))
DZ Ý Data Science MMW 2018 October 10, 2018 34 / 127
35. Support Vector Machines and the Hinge Loss
The Support Vector Machine classifier can be formulated as
Minimize E(w, b) =
1
n
n
i=1
(1 − yi(w⊤
xi + b))+
subject to
w 2
2 < τ.
Which is equivalent in regularized (lagrangian) form to
(w, b) = arg min
w∈Rq
1
n
n
i=1
(1 − yi(w⊤
xi + b))+ + λ w 2
2
The SVM linear binary classification estimator is given by
fn(x) = sign(h(x)) = sign(w⊤
x + b)
where w and b are estimators of w and b respectively.
DZ Ý Data Science MMW 2018 October 10, 2018 35 / 127
36. Classification realized with Linear Boundary
SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1
Figure: Linear SVM classifier with a relatively small margin
DZ Ý Data Science MMW 2018 October 10, 2018 36 / 127
37. Classification realized with Linear Boundary
SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1
Figure: Linear SVM classifier with a relatively large margin
DZ Ý Data Science MMW 2018 October 10, 2018 37 / 127
38. SVM Learning via Quadratic Programming
When the decision boundary is nonlinear, the αi’s in the expression of
the support vector machine classifier ˆf are determined by solving the
following quadratic programming problem
Maximize E(α) =
n
i=1
αi −
1
2
n
i=1
n
j=1
αiαjyiyjK(xi, xj).
subject to
0 ≤ αi ≤ C (i = 1, · · · , n) and
n
i=1
αiyi = 0.
The above formulation is an instance of the general QP
Maximize −
1
2
α⊤
Qα + 1⊤
α
subject to
α⊤
y = 0 and αi ∈ [0, C], ∀i ∈ [n].
n×nDZ Ý Data Science MMW 2018 October 10, 2018 38 / 127
39. SVM Learning via Quadratic Programming in R
The quadratic programming problem
Maximize −
1
2
α⊤
Qα + 1⊤
α
subject to α⊤y = 0 and αi ∈ [0, C], ∀i ∈ [n]. is equivalent to
Minimize
1
2
α⊤
Qα − 1⊤
α
subject to α⊤y = 0 and αi ∈ [0, C], ∀i ∈ [n].
Which is solved with the R package kernlab via the function ipop()
Minimize c⊤
α +
1
2
α⊤
Hα
subject to b ≤ Aα ≤ b + r and l ≤ α ≤ u.
DZ Ý Data Science MMW 2018 October 10, 2018 39 / 127
40. Support Vector Machines and Kernels
As a result of the kernelization, the SVM classifier delivers for each x,
the estimated response
fn(x) = sign
|s|
j=1
ˆαsj ysj K(xsj , x) + ˆb
where sj ∈ {1, 2, · · · , n}, s = {s1, s2, · · · , s|s|} and |s| ≪ n.
The kernel K(·, ·) is a bivariate function K : X × X −→ IR+ such
that given xl, xm ∈ X, the value of
K(xl, xm) = Φ(xl), Φ(xm) = Φ(xl)⊤
Φ(xm)
represents the similarity between xl and xm, and corresponds to an
implicit inner product in some feature space Z of dimension higher
than dim(X), where the decision boundary is conveniently a large
margin separating hyperplane.
Trick: There is never any need in practice to explicitly manipulated
the higher dimensional feature mapping Φ : X −→ Z.
DZ Ý Data Science MMW 2018 October 10, 2018 40 / 127
41. Classification realized with Nonlinear Boundary
SVM Optimal Separating and Margin Hyperplanes
Figure: Nonlinear SVM classifier with a relatively small margin
DZ Ý Data Science MMW 2018 October 10, 2018 41 / 127
42. Interplay between the aspects of statistical learning
DZ Ý Data Science MMW 2018 October 10, 2018 42 / 127
43. Statistical Consistency
Definition: Let θn be an estimator of some scalar quantity θ based
on an i.i.d. sample X1, X2, · · · , Xn from the distribution with
parameter θ. Then, θn is said to be a consistent estimator of θ, if θn
converges in probability to θ, i.e.,
θn
P
−→
n→∞
θ.
In other words, θn is a consistent estimator of θ if, ∀ǫ > 0,
lim
n→∞
Pr |θn − θ| > ǫ = 0.
It turns out that for unbiased estimators θn, consistency is
straightforward as direct consequence of a basic probabilistic
inequality like Chebyshev’s inequality. However, for unbiased
estimators, one has to be more careful.
DZ Ý Data Science MMW 2018 October 10, 2018 43 / 127
44. A Basic Important Inequality
ê¦
(Biename-Chebyshev’s inequality) Let X be a random variable with finite
mean µX = E[X] i.e. |E[X]| < +∞ and finite variance σ2
X = V(X) , i.e.,
|V(X)| < +∞. Then, ∀ǫ > 0,
Pr[|X − E[X]| > ǫ] ≤
V(X)
ǫ2
.
It is therefore easy to see here that, with unbiased θn, one has E[θn] = θ,
and the result is immediate. For the sake of clarity, let’s recall here the
elementary weak law of large numbers.
DZ Ý Data Science MMW 2018 October 10, 2018 44 / 127
45. Weak Law of Large Numbers
Let X be a random variable with finite mean µX = E[X] i.e.
|E[X]| < +∞ and finite variance σ2
X = V(X) , i.e., |V(X)| < +∞. Let
X1, X2, · · · , Xn be a random sample of n observations drawn
independently from the distribution of X, so that for i = 1, · · · , n, we
have E[Xi] = µ and V[Xi] = σ2 . Let ¯Xn be the sample mean, i.e.,
¯Xn =
1
n
(X1 + X2 + · · · + Xn) =
1
n
n
i=1
Xi
Then, clearly, E[ ¯Xn] = µ, and, ∀ǫ > 0,
lim
n→∞
Pr[| ¯Xn − µ| > ǫ] = 0. (3)
This essentially expresses the fact that the empirical mean ¯Xn converges
in probability to the theoretical mean µ in the limit of very large samples.
DZ Ý Data Science MMW 2018 October 10, 2018 45 / 127
46. Weak Law of Large Numbers
We therefore have
¯Xn
P
−→
n→∞
µ.
With µ ¯X = E[ ¯Xn] = µ and σ2
¯X
= σ2/n, one applyies
Biename-Chebyshev’s inequality and gets: ∀ǫ > 0,
Pr[| ¯X − µ| > ǫ] ≤
σ2
nǫ2
, (4)
which, by inversion, is the same as
| ¯X − µ| <
1
δ
σ2
n
(5)
with probability at least 1 − δ.
Why is all the above of any interest to statistical learning theory?
DZ Ý Data Science MMW 2018 October 10, 2018 46 / 127
47. Weak Law of Large Numbers
Why is all the above of any interest to statistical learning theory?
Equation (3) states the much needed consistency of ¯X as an
estimator of µ.
Equation (4), by showing the dependence of on n and ε helps assess
the rate at which ¯X converges to µ.
Equation (5), by showing a confidence interval helps compute bounds
on the unknown true mean µ as a function of the empirical mean ¯X
and the confidence level 1 − δ.
Finally, how does go about constructing estimators with all the above
properties.
DZ Ý Data Science MMW 2018 October 10, 2018 47 / 127
48. Effect of Bias-Variance Dilemma of Prediction
Optimal Prediction achieved at the point of bias-variance trade-off.
DZ Ý Data Science MMW 2018 October 10, 2018 48 / 127
49. Theoretical Aspects of Statistical Learning
For binary classification using the so-called 0/1 loss function, the
Vapnik-Chervonenkis inequality takes the form
P sup
f∈F
| ˆRn(f) − R(f)| > ε ≤ 8S(F, n)e−nε2/32
(6)
which is also expression in terms of expectation as
E sup
f∈F
| ˆRn(f) − R(f)| ≤ 2
log S(F, n) + log 2
n
(7)
The quantity S(F, n) plays an important role of the CV Theory and
will explored in greater details later.
Note that these bounds including the one presented earlier in the VC
Fundamental Machine Learning Theorem are not asymptotic bounds.
They hold for any n.
The bounds are nice and easy if h or S(F, n) is known.
Unfortunately the bound may exceed 1, making it useless.
DZ Ý Data Science MMW 2018 October 10, 2018 49 / 127
50. Components of Statistical Machine Learning
Interestingly, all those 4 components of classical estimation theory, will be
encountered again in statistical learning theory. Essentially, the 4
components of statistical learning theory consist of finding the answers to
the following questions:
(a) What are the necessary and sufficient conditions for the
consistency of a learning process based on the ERM principle? This
leads to the Theory of consistency of learning processes.
(b) How fast is the rate of convergence of the learning process? This
leads to the Nonasymptotic theory of the rate of convergence of
learning processes;
(c) How can one control the rate of convergence (the generalization
ability) of the learning process?. This leads to the Theory of
controlling the generalization ability of learning processes;
(d) How can one construct algorithms that can control the
generalization ability of the learning process?. This leads to Theory of
constructing learning algorithms.
DZ Ý Data Science MMW 2018 October 10, 2018 50 / 127
51. Error Decomposition revisited
A reasoning on error decomposition and consistency of estimators along
with rates, bounds and algorithms applies to function spaces: indeed, the
difference between the true risk R(fn) associated with fn and the overall
minimum risk R∗ can be decomposed to explore in greater details the
source of error in the function estimation process:
R(fn) − R∗
= R(fn) − R(f+
)
Estimation error
+ R(f+
) − R∗
Approximation error
(8)
A reasoning similar to bias-variance trade-off and consistency can be
made, with the added complication brought be the need to distinguish
between the true risk functional and the empirical risk functional, and also
to the added to assess both pointwise behaviors and uniform behaviors. In
a sense, one needs to generalize the decomposition and the law of large
numbers to function spaces.
DZ Ý Data Science MMW 2018 October 10, 2018 51 / 127
52. Approximation-Estimation Trade-Off
Optimal Smoothing
Less smoothing
Bias squared
True Risk
More smoothing
Variance
Figure: Illustration of the qualitative behavior of the dependence of bias versus
variance on a tradeoff parameter such as λ or h. For small values the variability is
too high; for large values the bias gets large.
DZ Ý Data Science MMW 2018 October 10, 2018 52 / 127
53. Consistency of the Empirical Risk Minimization principle
The ERM principle is consistent if it provides a sequence of functions
ˆfn, n = 1, 2, · · · for which both the expected risk R(fn) and the
empirical risk Rn(fn) converge to the minimal possible value of the
risk R(f+) in the function class under consideration, i.e.,
R( ˆfn)
P
−→
n→∞
inf
f∈F
R(f) = R(f+
)
and
Rn( ˆfn)
P
−→
n→∞
inf
f∈F
R(f) = R(f+
)
Vapnik discusses the details of this theorem at length, and extends
the exploration to include the difference between what he calls trivial
consistency and non-trivial consistency.
DZ Ý Data Science MMW 2018 October 10, 2018 53 / 127
54. Consistency of the Empirical Risk Minimization principle
To better understand consistency in function spaces, consider the
sequence of random variables
ξn
= sup
f∈F
R(f) − Rn(f) , (9)
and consider studying
lim
n→∞
P sup
f∈F
R(f) − Rn(f) > ε = 0, ∀ε > 0.
Vapnik shows that the sequence of the means of the random variable
ξn converges to zero as the number n of observations increases.
He also remarks that the sequence of random variables ξn converges
in probability to zero if the set of functions F, contains a finite
number m of elements. We will show that later in the case of pattern
recognition.
DZ Ý Data Science MMW 2018 October 10, 2018 54 / 127
55. Consistency of the Empirical Risk Minimization principle
It remains then to describe the properties of the set of functions F,
and probability measure P(x, y) under which the sequence of random
variables ξn converges in probability to zero.
lim
n→∞
P sup
f∈F
[R(f) − Rn(f)] > ε or sup
f∈F
[Rn(f) − R(f)] > ε = 0.
Recall that Rn(f) is the realized disagreement between classifier f
and the truth about the label y of x based on information contained
in the sample D.
It is easy to see that, for a given (fixed) function (classifier) f,
E[Rn(f)] = R(f). (10)
Note that while this pointwise unbiasedness of the empirical risk is a
good bottomline property to have, it is not enough. More is needed
as the comparison is against R(f+) or event better yet R(f∗).
DZ Ý Data Science MMW 2018 October 10, 2018 55 / 127
56. Consistency of the Empirical Risk
Remember that the goal of statistical function estimation is to devise
a technique (strategy) that chooses from the function class F, the
one function whose true risk is as close as possible to the lowest risk
in class F.
The question arises: since one cannot calculate the true error, how
can one devise a learning strategy for choosing classifiers based on it?
Tentative answer: At least devise strategies that yield functions for
which the upper bound on the theoretical risk is as tight as possible,
so that one can make confidence statements of the form:
With probability 1 − δ over an i.i.d. draw of some sample according
to the distribution P, the expected future error rate of some classifier
is bounded by a function g(δ, error rate on sample) of δ and the error
rate on sample.
Pr TestError ≤ TrainError + φ(n, δ, κ(F)) ≥ 1 − δ
DZ Ý Data Science MMW 2018 October 10, 2018 56 / 127
57. Foundation Result in Statistical Learning Theory
Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of
functions implementing so learning machines, and let ζ = V Cdim(F) be
the VC dimension of F. Let the theoretical and the empirical risks be
defined as earlier and consider any data distribution in the population of
interest. Then ∀f ∈ F, the prediction error (theoretical risk) is bounded by
R(f) ≤ ˆRn(f) +
ζ log 2n
ζ + 1 − log η
4
n
(11)
with probability of at least 1 − η. or
Pr TestError ≤ TrainError +
ζ log 2n
ζ + 1 − log η
4
n
≥ 1 − η
DZ Ý Data Science MMW 2018 October 10, 2018 57 / 127
58. Optimism of the Training Error
5 10 15 20
0.000.050.100.15
Complexity
Predictionerror
E[Training Error]
E[Test Error]
DZ Ý Data Science MMW 2018 October 10, 2018 58 / 127
59. Bounds on the Generalization Error
For instance, using Chebyshev’s inequality and the fact that
E[Rn(f)] = R(f), it is easy to see that, for given classifier f and a sample
D = {(x1, y1), · · · , (xn, yn)},
Pr[|Rn(f) − R(f)| > ǫ] ≤
R(f)(1 − R(f))
nǫ2
.
To estimate the true but unknown error R(f) with a probability of at least
1 − δ, it makes sense to use inversion, i.e., set
δ =
R(f)(1 − R(f))
nǫ2
, so that ǫ =
R(f)(1 − R(f))
nδ
.
Owing to the fact that max
R(f)∈[0,1]
R(f)(1 − R(f)) = 1
4 , we have
R(f)(1 − R(f))
nδ
<
1
4nδ
=
1
4nδ
1/2
.
DZ Ý Data Science MMW 2018 October 10, 2018 59 / 127
60. Bounds on the Generalization Error
Based on Chebyshev’s inequality, for a given classifier f, with a
probability of at least 1 − δ, the bound on the difference between the
true risk R(f) and the empirical risk Rn(f) is given by
|Rn(f) − R(f)| <
1
4nδ
1/2
.
Recall that one of the goals of statistical learning theory is to assess
the rate of convergence of the empirical risk to the true risk, which
translates into assessing how tight the corresponding bounds on the
true risk are.
In fact, it turns out many bounds can be so loose as to become
useless. It turns out that the above Chebyshev-based bound is not a
good one, at least compared to bounds obtained using the so-called
hoeffding’s inequality.
DZ Ý Data Science MMW 2018 October 10, 2018 60 / 127
61. Bounds on the Generalization Error
Theorem:(Hoeffding’s inequality) Let Z1, Z2, · · · , Zn be a collection
of i.i.d random variables with Zi ∈ [a, b]. Then, ∀ǫ > 0,
Pr
1
n
n
i=1
Zi − E[Z] > ǫ ≤ 2 exp
−2nǫ2
(b − a)2
corollary:(hoeffding’s inequality for sample proportions) Let
Z1, Z2, · · · , Zn be a collection of i.i.d random variables from a
Bernoulli distribution with ”success” probability p. Let
pn = 1
n
n
i=1 Zi. Clearly, pn ∈ [0, 1] and E[pn] = p.
Therefore, as a direct consequence of the above theorem, we have,
∀ǫ > 0,
Pr [|pn − p| > ǫ] ≤ 2 exp −2nǫ2
DZ Ý Data Science MMW 2018 October 10, 2018 61 / 127
62. Bounds on the Generalization Error
So we have, ∀ǫ > 0,
Pr [|pn − p| > ǫ] ≤ 2 exp −2nǫ2
Now, setting δ = 2 exp(−2ǫ2n), it is straightforward to see that the
hoeffding-based 1 − δ level confidence bound on the difference
between R(f) and Rn(f) for a fixed classifier f is given by
|Rn(f) − R(f)| <
ln 2
δ
2n
1/2
.
Which of the two bounds is tighter? Clearly, we need to find out
which of ln 2/δ or 1/2δ is larger. This is the same as comparing
exp(1/2δ) and 2/δ, which in turns means comparing a(2/δ) and 2/δ
where a = exp(1/4). With δ > 0, a(2/δ) > 2/δ, so that, we know
that hoeffding’s bounds are tighter. The graph also confirm this.
DZ Ý Data Science MMW 2018 October 10, 2018 62 / 127
63. Bounds on the Generalization Error
0 2000 4000 6000 8000 10000 12000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
n = Sample size
Theoreticalboundf(n,δ)
Chernoff vs Chebyshev bounds for proportions: delta = 0.01
Chernoff
Chebyshev
0 2000 4000 6000 8000 10000 12000
0
0.05
0.1
0.15
0.2
0.25
n = Sample size
Theoreticalboundf(n,δ)
Chernoff vs Chebyshev bounds for proportions: delta = 0.05
Chernoff
Chebyshev
DZ Ý Data Science MMW 2018 October 10, 2018 63 / 127
64. Beyond Chernov and Hoeffding
In all the above, we only addressed pointwise convergence of
Rn(f) to R(f), i.e., for Fix a machine f ∈ F, we studied the
convergence of
Rn(f) to R(f).
Needless to mention that that pointwise convergence is of very little
use here.
A more interesting issue to address is uniform convergence. That is,
for all machines, f ∈ F, determine the necessary and sufficient
conditions for the convergence of
sup
f∈F
|Rn(f) − R(f)| > ǫ to 0.
Clearly, such a study extends the Law of Large Numbers to function
spaces, thereby providing tools for the construction of bounds on the
theoretical errors of learning machines.
DZ Ý Data Science MMW 2018 October 10, 2018 64 / 127
65. Beyond Chernov and Hoeffding
Since uniform convergence requires the consideration of the entirety
of the function space of interest, care needs to be taken regarding the
dimensionality of the function space.
Uniform convergence will prove substantially easier to handle for finite
sample spaces than for infinite dimensional function spaces.
Indeed, infinity dimensional spaces, one will need to introduce such
concepts of the capacity of the function space, measured through
devices such as the VC-dimension and covering numbers.
DZ Ý Data Science MMW 2018 October 10, 2018 65 / 127
66. Beyond Chernov and Hoeffding
Theorem: If Rn(f) and R(f) are close for all f ∈ F, i.e., ∀ǫ > 0,
sup
f∈F
|Rn(f) − R(f)| ≤ ǫ,
then
R(fn) − R(f+
) ≤ 2ǫ.
Proof:Recall that we did define fn as the best function that is yielded by
the empirical risk Rn(f) in the function class F. Recall also that Rn(fn)
can be made as small as possible as we saw earlier. Therefore, with f+
being the best true risk in class F, we always have
Rn(f+
) − Rn(fn) ≥ 0.
As a result,
R(fn) = R(fn) − R(f+
) + R(f+
)
= Rn(f+
) − Rn(fn) + R(fn) − R(f+
) + R(f+
)
≤ 2sup
f∈F
|R(f) − Rn(f)| + R(f+
)
DZ Ý Data Science MMW 2018 October 10, 2018 66 / 127
67. Beyond Chernov and Hoeffding
Proof:Recall that we did define fn as the best function that is yielded by
the empirical risk Rn(f) in the function class F. Recall also that Rn(fn)
can be made as small as possible as we saw earlier. Therefore, with f+
being the best true risk in class F, we always have
Rn(f+
) − Rn(fn) ≥ 0.
As a result,
R(fn) = R(fn) − R(f+
) + R(f+
)
= Rn(f+
) − Rn(fn) + R(fn) − R(f+
) + R(f+
)
≤ 2sup
f∈F
|R(f) − Rn(f)| + R(f+
)
Consequently,
R(fn) − R(f+
) ≤ 2sup
f∈F
|R(f) − Rn(f)|
as required.
DZ Ý Data Science MMW 2018 October 10, 2018 67 / 127
68. Beyond Chernov and Hoeffding
Corollary: A direct consequence of the above theorem is the following:
For a given machine f ∈ F,
R(f) ≤ Rn(f) +
ln 2
δ
2n
1/2
with probability at least 1 − δ, ∀δ > 0.
If the function class F is finite, ie
F = {f1, f2, · · · , fm}
where m = |F| = #F = Number of functions in the class F then it
can be shown that, for all f ∈ F,
R(f) ≤ Rn(f) +
ln m + ln 2
δ
2n
1/2
with probability at least 1 − δ, ∀δ > 0.
DZ Ý Data Science MMW 2018 October 10, 2018 68 / 127
69. Beyond Chernov and Hoeffding
It can also be shown that
R( ˆfn) ≤ Rn(f+
) + 2
ln m + ln 2
δ
2n
1/2
(12)
with probability at least 1 − δ, ∀δ > 0, where as before
f+
= arg inf
f∈F
R(f) and ˆfn = argmin
f∈F
Rn(f).
Equation (12) is of foundational importance, because it reveals clearly
that the size of the function class controls the uniform bound on the
crucial generalization error: Indeed, if the size m of the function class
F increases, then R(f+) is caused to increase while R(fn) decreases,
so that the trade-off between the two is controlled by the size m of
the function class.
DZ Ý Data Science MMW 2018 October 10, 2018 69 / 127
70. Vapnik-Chervonenkis Dimension
Definition: (Shattering) Let X = ∅ be any non empty domain. Let
F ⊆ 2X be any non-empty class of functions having X as their
domain. Let S ⊆ X be any finite subset of the domain X. Then S is
said to be shattered by F iff
{S ∩ f | f ∈ F} = 2S
In other words, F shatters S if any subset of S can be obtained by
intersecting S with some set from F.
Example: A class F ⊆ 2X of classifiers is said to shatter a set
x1, x2, · · · , xn of n points, if, for any possible configuration of labels
y1, y2, · · · , yn, we can find a classifier f ∈ F that reproduces those
labels.
DZ Ý Data Science MMW 2018 October 10, 2018 70 / 127
71. Vapnik-Chervonenkis Dimension
Definition(VC-dimension) Let X = ∅ be any non empty learning
domain. Let F ⊆ 2X be any non-empty class of functions having X
as their domain. Let S ⊆ X be any finite subset of the domain X.
The VC dimension of F is the cardinality of the largest finite set
S ⊆ X that is shattered by F, ie
V Cdim(F) := max |S| : S is shattered by F
Note: If arbitrarily large finite sets are shattered by F, then
V Cdim(F) = ∞. In other words, if a small set of finite cardinality
cannot be found that is shattered by F, then V Cdim(F) = ∞.
Example: The VC dimension of a class F ⊆ 2X of classifiers is the
largest number of points that F can shatter.
DZ Ý Data Science MMW 2018 October 10, 2018 71 / 127
72. Vapnik-Chervonenkis Dimension
Remarks: If V Cdim(F) = d, then there exists a finite set S ⊆ X
such that |S| = d and S is shattered by F. Importantly, every set
S ⊆ X such that |S| > d is not shattered by F. Clearly, we do not
expect to learn anything until we have at least d training points.
Intuitively, this means that an infinite VC dimension is not desirable
as it could imply the impossibility to learn the concept underlying any
data from the population under consideration. However, a finite VC
dimension does not guarantee the learnability of the concept
underlying any data from the population under consideration either.
Fact: Let F be any finite function (concept) class. Then, since it
requires 2d distinct concepts to shatter a set of cardinality d, no set of
cardinality greater than log |F| can be shattered. Therefore, log |F| is
always an upper bound for the VC dimension of finite concept classes.
DZ Ý Data Science MMW 2018 October 10, 2018 72 / 127
73. Vapnik-Chervonenkis Dimension
To gain insights into the central concept of VC dimension, we herein
consider a few examples of practical interest for which the VC
dimension can be found.
VC dimension of the space of separating hyperplanes: Let
X = Rp be the domain for the binary Y ∈ {−1, +1} classification
task, and consider using hyperplanes to separate the points of X. Let
F denote the class of all such separating hyperplanes. Then,
V Cdim(F) = p + 1
Intuitively, the following pictures for the case of X = R2 help see why
the VC dimension is p + 1.
DZ Ý Data Science MMW 2018 October 10, 2018 73 / 127
74. Foundation Result in Statistical Learning Theory
Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of
functions implementing so learning machines, and let ζ = V Cdim(F) be
the VC dimension of F. Let the theoretical and the empirical risks be
defined as earlier and consider any data distribution in the population of
interest. Then ∀f ∈ F, the prediction error (theoretical risk) is bounded by
R(f) ≤ ˆRn(f) +
ζ log 2n
ζ + 1 − log η
4
n
(13)
with probability of at least 1 − η. or
Pr TestError ≤ TrainError +
ζ log 2n
ζ + 1 − log η
4
n
≥ 1 − η
DZ Ý Data Science MMW 2018 October 10, 2018 74 / 127
75. Confidence Interval for a proportion
p ∈ ˆp − zα/2
ˆp(1−ˆp)
n , ˆp + zα/2
ˆp(1−ˆp)
n with 100(1 − α)% confidence
0.0 0.2 0.4 0.6
020406080100
Building 95 % CIs. Here 98 intervals out of 100 contain p. That is 98 %
lower bound and upper bound of interval
Sampleindex
DZ Ý Data Science MMW 2018 October 10, 2018 75 / 127
76. Confidence Interval for a proportion
p ∈ ˆp − zα/2
ˆp(1−ˆp)
n , ˆp + zα/2
ˆp(1−ˆp)
n with 100(1 − α)% confidence
0.2 0.4 0.6 0.8 1.0
020406080100
Building 95 % CIs. Here 94 intervals out of 100 contain p. That is 94 %
lower bound and upper bound of interval
Sampleindex
DZ Ý Data Science MMW 2018 October 10, 2018 76 / 127
77. Confidence Interval for a proportion
p ∈ ˆp − zα/2
ˆp(1−ˆp)
n , ˆp + zα/2
ˆp(1−ˆp)
n with 100(1 − α)% confidence
0.2 0.4 0.6 0.8
020406080100
Building 90 % CIs. Here 92 intervals out of 100 contain p. That is 92 %
lower bound and upper bound of interval
Sampleindex
DZ Ý Data Science MMW 2018 October 10, 2018 77 / 127
78. Confidence Interval for a population mean
µ ∈ ¯x − zα/2
σ2)
n , ¯x + zα/2
σ2
n with 100 × (1 − α)% confidence
8 9 10 11
020406080100
Building 95 % CIs. Here 98 intervals out of 100 contain mu. That is 98 %
lower bound and upper bound of interval
Sampleindex
DZ Ý Data Science MMW 2018 October 10, 2018 78 / 127
79. Confidence Interval for a population mean
µ ∈ ¯x − zα/2
σ2)
n , ¯x + zα/2
σ2
n with 100 × (1 − α)% confidence
8 9 10 11
020406080100
Building 85 % CIs. Here 90 intervals out of 100 contain mu. That is 90 %
lower bound and upper bound of interval
Sampleindex
DZ Ý Data Science MMW 2018 October 10, 2018 79 / 127
80. Effect of Bias-Variance Dilemma of Prediction
Optimal Prediction achieved at the point of bias-variance trade-off.
DZ Ý Data Science MMW 2018 October 10, 2018 80 / 127
81. VC Bound for Separating Hyperplanes
î
Let L represent the function class of binary classifiers in q-dimension, ie
L = f : ∃w ∈ IRq
, w0 ∈ IR, f(x) = sign(w⊤
x + w0), ∀x ∈ X ,
then VCDim(L) = h = q + 1.
With labels taken from {−1, +1}, and using the 0/1 loss function, we
have the fundamental theorem from Vapnik and Chervonenkis, namely,
ê¦
For every f ∈ L, and n > h, with probability at least 1 − η, we have
R(f) ≤ Rn(f) +
h log 2n
h + 1 + log 4
η
n
The above result holds true for LDA.DZ Ý Data Science MMW 2018 October 10, 2018 81 / 127
82. Appeal of the VC Bound
Note: One of the greatest appeals of the VC bound is that, though
applicable to function classes of infinite dimension, it preserves the
same intuitive form as the bound derived for finite dimensional F.
Essentially, using the VC dimension concept, the number L of
possible labeling configurations obtainable from F with
V Cdim(F) = ζ over 2n points verifies
L ≤
en
ζ
ζ
. (14)
The VC bound is simply obtained by replacing log |F| with L in the
expression of the risk bound for finite dimensional F.
The most important part of the above theorem is the fact that the
generalization ability of a learning machine depends on both the
empirical risk and the complexity of the class of functions used, which
is measured here by the VC dimension of (Vapnik and
Chervonenkis, 1971).
DZ Ý Data Science MMW 2018 October 10, 2018 82 / 127
83. Appeal of the VC Bound
Also, the bounds offered here are distribution-free, since no
assumption is made about the distribution of the population.
The details of this important result will be discussed again in chapter
6 and 7, where we will present other measures of the capacity of a
class of functions.
Remark: From the expression of the VC Bound, it is clear that an
intuitively appealing way to improve the predictive performance
(reduce prediction error) of a class of machines is to achieve a
trade-off (compromise) between small VC dimension and
minimization of the empirical risk.
At first, it may seen as if the VC bound is acting in a way similar to
the number of parameters, since it serves as a measure of the
complexity of F. In this spirit, the following is a possible guiding
principle.
DZ Ý Data Science MMW 2018 October 10, 2018 83 / 127
84. Appeal of the VC Bound
At first, it may seen as if the VC bound is acting in a way similar
to the number of parameters, since it serves as a measure of the
complexity of F. In this spirit, the following is a possible guiding
principle.
Intuition: One should seek to construct a classifier that
achieves the best trade-off (balance, compromise) between
complexity of function class - measured by VC dimension- and fit
to the training data -measured by empirical risk.
Now equipped with this sound theoretical foundation one can
then go on to the implementation of varioous learning machines.
We shall use R to discover some of the most commonly learning
machines.
DZ Ý Data Science MMW 2018 October 10, 2018 84 / 127
88. Motivating Example Regression Analysis
Consider the univariate function f ∈ C([0, 2π]) given by
f(x) =
π
2
x +
3
4
π cos
π
2
(1 + x) (15)
Simulate of an artificial iid data set D = {(xi, yi), i = 1, · · · , n}, with
n = 99 and σ = π/3
xi ∈ [0, 2π] drawn deterministically and equally spaced
Yi = f(xi) + εi
εi
iid
∼ N(0, σ2)
The R code is
f <- function(x){(pi/2)*x + (3*pi/4)*cos((pi/2)*(1+x))}
x <- seq(0, 2*pi, length=n)
y <- f(x) + rnorm(n, 0, pi/3)
DZ Ý Data Science MMW 2018 October 10, 2018 88 / 127
89. Motivating Example Regression Analysis
Noisy data generated with function (19)
0 1 2 3 4 5 6
0510
x
y
Question: What is the best hypothesis space to learn the underlying
function?
DZ Ý Data Science MMW 2018 October 10, 2018 89 / 127
90. Bias-Variance Tradeoff in Action
0 1 2 3 4 5 6
0510
x
y
(a) Underfit
0 1 2 3 4 5 6
0510
x
y
(b) Optimal fit
0510
y
DZ Ý Data Science MMW 2018 October 10, 2018 90 / 127
91. Introduction to Regression Analysis
We have, xi = (xi1, · · · , xip)⊤ ∈ IRp
and Yi ∈ IR, and data set
D = (x1, Y1), (x2, Y2), · · · , (xn, Yn)
We assume that the response variable Yi is related to the explanatory
vector xi through a function f via the model,
Yi = f(xi) + ξi, i = 1, · · · , n (16)
The explanatory vectors xi are fixed (non-random)
The regression function f : IRp
→ IR is unknown
The error terms ξi are iid Gaussian, i.e. ξi
iid
∼ N(0, σ2
)
Goal: We seek to estimate the function f using the data in D.
DZ Ý Data Science MMW 2018 October 10, 2018 91 / 127
92. Formulation of the regression problem
Let X and Y be two random variables s.t
E[Y ] = µ and E[Y 2
] < ∞
.
Goal: Find the best predictor f(X) of Y given X.
Important Questions
How does one define ”best”?
Is the very best attainable in practice?
What does the function f look like? (Function class)
How do we select a candidate from the chosen class of functions?
How hard is it computationally to find the desired function?
DZ Ý Data Science MMW 2018 October 10, 2018 92 / 127
93. Loss functions
1 When f(X) is used to predict Y , a loss is incurred.
Question: How is such a loss quantified?
Answer: Define a suitable loss function.
2 Common loss functions in regression
Squared error loss or (ℓ2) loss
ℓ(Y, f(X)) = (Y − f(X))2
ℓ2 is by far the most used (prevalent) because of its differentiability.
Unfortunately, not very robust to outliers.
Absolute error loss or (ℓ1) loss
ℓ(Y, f(X)) = |Y − f(X)|
ℓ1 is more robust to outliers, but not differentiable at zero.
3 Note that ℓ(Y, f(X)) is a random variable.
DZ Ý Data Science MMW 2018 October 10, 2018 93 / 127
94. Risk Functionals and Cost Functions
1 Definition of a risk functional,
R(f) = E[ℓ(Y, f(X))] =
X×Y
ℓ(y, f(x))pXY (x, y)dxdy
R(f) is the expected loss over all pairs of the cross space X × Y.
2 Ideally, one seeks the best out of all possible functions, i.e.,
f∗
(X) = arg min
f
R(f) = arg min
f
E[ℓ(Y, f(X))]
f∗(·) is such that
R∗
= R(f∗
) = min
f
R(f)
3 This ideal function cannot be found in practice, because the fact that
the distributions are unknown, make it impossible to form an
expression for R(f).
DZ Ý Data Science MMW 2018 October 10, 2018 94 / 127
95. Cost Functions and Risk Functionals
Theorem: Under regularity conditions,
f∗
(X) = E[Y |X] = arg min
f
E[(Y − f(X))2
]
Under the squared error loss, the optimal function f∗ that yields the
best prediction of Y given X is no other than the expected value of
Y given X.
Since we know neither pXY (x, y) nor pX(x), the conditional
expectation
E[Y |X] =
Y
ypY |X(y)(dy) =
Y
y
pXY (x, y)
pX(x)
dy
cannot be directly computed.
DZ Ý Data Science MMW 2018 October 10, 2018 95 / 127
96. Empirical Risk Minimization
Let D = (X1, Y1), (X2, Y2), · · · , (Xn, Yn) represent an iid sample
The empirical version of the risk functional is
R(f) = MSE(f) = E[(Y − f(X))2] =
1
n
n
i=1
(Yi − f(Xi))2
It turns out that R(f) provides an unbiased estimator of R(f).
We therefore seek the best by empirical standard,
ˆf∗
(X) = arg min
f
MSE(f) = arg min
f
1
n
n
i=1
(Yi − f(Xi))2
Since it is impossible to search all possible functions, it is usually
crucial to choose the ”right” function space.
DZ Ý Data Science MMW 2018 October 10, 2018 96 / 127
97. Function spaces
For the function estimation task for instance, one could assume that the
input space X is a closed and bounded interval of IR, i.e. X = [a, b], and
then consider estimating the dependencies between x and y from within
the space F all bounded functions on X = [a, b], i.e.,
F = {f : X → IR| ∃B ≥ 0, such that |f(x)| ≤ B, for all x ∈ X}.
One could even be more specific and make the functions of the above F
continuous, so that the space to search becomes
F = {f : [a, b] → IR| f is continuous} = C([a, b]),
which is the well-known space of all continuous functions on a closed and
bounded interval [a, b]. This is indeed a very important function space.
DZ Ý Data Science MMW 2018 October 10, 2018 97 / 127
98. Space of Univariate Polynomials
In fact, polynomial regression consists of searching from a function space
that is a subspace of C([a, b]). In other words, when we are doing the very
common polynomial regression, we are searching the space
P([a, b]) = {f ∈ C([a, b])| f is a polynomial with real coefficients} .
It is interesting to note that Weierstrass did prove that P([a, b]) is dense in
C([a, b]). One considers the space of all polynomial of some degree p, i.e.,
F = Pp
([a, b]) = f ∈ C([a, b])| ∃β0, β1, · · · , βp ∈ IR|
f(x) =
p
j=0
βjxj
, ∀x ∈ [a, b]
DZ Ý Data Science MMW 2018 October 10, 2018 98 / 127
99. Empirical Risk Minimization in F
Having chosen a class F of functions, we can now seek
ˆf(X) = arg min
f∈F
MSE(f) = arg min
f∈F
1
n
n
i=1
(Yi − f(Xi))2
We are seeking the best function in the function space chosen.
For instance, if the function space in the space of all polynomials of
degree p in some interval [a, b], finding ˆf boils down to estimating the
coefficients of the polynomial using the data, namely
ˆf(x) = ˆβ0 + ˆβ1x + ˆβ2x2
+ · · · + ˆβpxp
where using β = (β0, β1, · · · , βp)⊤, we have
ˆβ = arg min
β∈IRp+1
1
n
n
i=1
Yi −
p
j=0
βjxj
i
2
DZ Ý Data Science MMW 2018 October 10, 2018 99 / 127
100. Important Aspects of Statistical Learning
It is very tempting at first to use the data at hand to find/build the ˆf
that makes MSE( ˆf) is the smallest. For instance, the higher the value
of p, the smaller MSE( ˆf(·)) will get.
The estimate ˆβ = (ˆβ0, ˆβ1, · · · , ˆβp)⊤ of β = (β0, β1, · · · , βp)⊤, is a
random variable, and as a result the estimate
ˆf(x) = ˆβ0 + ˆβ1x + ˆβ2x2 + · · · + ˆβpxp of f(x) is also a random
variable.
Since ˆf(x) is random variable, we must compute important aspects
like its bias B[ ˆf(x)] = E[ ˆf(x)] − f(x) and its variance V[ ˆf(x)].
We have a dilemma: If we make ˆf complex (large p), we make the
bias small but the variance is increased. If we make ˆf simple (small
p), we make the bias large but the variance is decreased.
Most of Modern Statistical Learning is rich with model selection
techniques that seek to achieve a trade-off between bias and variance
to get the optimal model. Principle of parsimony (sparsity),
Ockham’s razor principle.
DZ Ý Data Science MMW 2018 October 10, 2018 100 / 127
101. Effect of Bias-Variance Dilemma of Prediction
Optimal Prediction achieved at the point of bias-variance trade-off.
DZ Ý Data Science MMW 2018 October 10, 2018 101 / 127
102. Theoretical Aspects of Statistical Regression Learning
Just like we have a VC bound for classification, there is one for
regression, ie when Y = IR and
ˆRn(f) =
1
n
n
i=1
|yi − f(xi)|2
= Squared error loss
Indeed, for every f ∈ F, with probability at least 1 − η, we have
R(f) ≤
ˆRn(f)
(1 − c
√
δ)+
where
δ =
a
n
v + v log
bn
v
− log
η
4
Note once again as before that these bounds are not asymptotic
Unfortunately these bounds are known to be very loose in practice.
DZ Ý Data Science MMW 2018 October 10, 2018 102 / 127
103. The pitfalls of memorization and overfitting
The trouble - limitation - with naively using a criterion on the whole
sample lies in the fact, given a sample (x1, y1), (x2, y2), · · · , (xn, yn), the
function ˆfmemory defined by
ˆfmemory(xi) = yi, i = 1, · · · , n
always achieves the best performance, since MSE( ˆfmemory) = 0, which is the
minimum achievable.
Where does the limitation of ˆfmemory come from? Well, ˆfmemory
does not really learn the dependency between X and Y . While it may
have some of it, it also grabs a lot of the noise in the data, and ends
overfitting the data. As a result of not really learning the structure of the
relationship between X and Y and only merely memorizing the present
sample values, ˆfmemory will predict very poorly when presented
with observations that were not in the sample.
DZ Ý Data Science MMW 2018 October 10, 2018 103 / 127
104. Training Set Test Set Split
Splitting the data into training set and test set: It makes
sense to judge models (functions), not on how they perform with in
sample observations, but instead how they perform on out of sample
cases. Given a collection D = (x1, y1), (x2, y2), · · · , (xn, yn) of pairs,
Randomly split D into training set of size ntr and test set of size
nte, such that ntr + nte = n
Training set
Tr = (x
(tr)
1 , y
(tr)
1 ), (x
(tr)
2 , y
(tr)
2 ), · · · , (x
(tr)
ntr , y
(tr)
ntr )
Training set
Te = (x
(te)
1 , y
(te)
1 ), (x
(te)
2 , y
(te)
2 ), · · · , (x
(te)
nte , y
(te)
nte )
DZ Ý Data Science MMW 2018 October 10, 2018 104 / 127
105. Training Set Test Set Split
For each function class F (linear models, nonparametrics, etc ...)
Find the best in its class based on the training set Tr
For all the estimated functions ˆf1, ˆf2, · · · , ˆfm, compute the training
error
MSETr( ˆfj) =
1
ntr
ntr
i=1
(y
(tr)
i − ˆfj(x
(tr)
i ))2
For all the estimated functions ˆf1, ˆf2, · · · , ˆfm, compute the test error
MSETe( ˆfj) =
1
nte
nte
i=1
(y
(te)
i − ˆfj(x
(te)
i ))2
Compute the averages of both MSETr and MSETe over many random
splits of the data, and tabulate (if necessary) those averages.
Select ˆfj∗ such that
mean[MSETe( ˆfj∗ )] < mean[MSETe( ˆfj)], j = 1, 2, · · · , m, j = j∗
DZ Ý Data Science MMW 2018 October 10, 2018 105 / 127
106. Computational Comparisons
Ideally, we would like to compare the true theoretical performances
measured by the risk functional
R(f) = E[ℓ(Y, f(X))] =
X×Y
ℓ(x, y)dP(x, y), (17)
Instead, we build the estimators using other optimality criteria, and
then compare their predictive performances using the average test
error AVTE(·), namely
AVTE(f) =
1
R
R
r=1
1
m
m
t=1
ℓ(y
(r)
it
, fr(x
(r)
it
)) , (18)
where fr(·) is the r-th realization of the estimator f(·) built using the
training portion of the split of D into training set and test set, and
x
(r)
it
, y
(r)
it
is the t-th observation from the test set at the r-th
random replication of the split of D.
DZ Ý Data Science MMW 2018 October 10, 2018 106 / 127
107. Learning Machines when n ≪ p
Machines Inherently designed to handle p larger than n problems
Classification and Regression Trees
Support Vector Machines
Relevance Vector Machines (n < 500)
Gaussian Process Learning Machines (n < 500)
k-Nearest Neighbors Learning Machines (Watch for the curse of
dimensionality)
Kernel Machines in general
Machines that cannot inherently handle p larger than n problems, but
can do so if regularized with suitable constraints
Multiple Linear Regression Models
Generalized Linear Models
Discriminant Analysis
Ensemble Learning Machines
Random Subspace Learning Ensembles (Random Forest)
Boosting and its extensions
DZ Ý Data Science MMW 2018 October 10, 2018 107 / 127
108. Motivating Example Regression Analysis
Consider the univariate function f ∈ C([−1, +1]) given by
f(x) = −x +
√
2 sin(π3/2
x2
) (19)
Simulate of an artificial iid data set D = {(xi, yi), i = 1, · · · , n}, with
n = 99 and σ = 3/10
xi ∈ [−1, +1] drawn deterministically and equally spaced
Yi = f(xi) + εi
εi
iid
∼ N(0, σ2)
The R code is
f <- function(x){-x + sqrt(2)*sin(pi^(3/2)*x^2)}
x <- seq(-1, +1, length=n)
y <- f(x) + rnorm(n, 0, 3/10)
DZ Ý Data Science MMW 2018 October 10, 2018 108 / 127
109. Estimation Error and Prediction Error
1
1
1
1
1
1
1
1
1
1
11
1
1
11
1
11
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
111
11
1
11
1
11
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
−1.0 −0.5 0.0 0.5 1.0
−3−2−1012
Predictive Regression with confidence and prediction bands
xnew
f(xnew)
data points
fit
lower conf
upper conf
lower pred
upper pred
Figure: Simple Orthogonal Polynomial Regression of with both confidence bands
and prediction bands on the test set. The true function is
f(x) = −x +
√
2 sin(π3/2
x2
) for x ∈ [−1, +1].
DZ Ý Data Science MMW 2018 October 10, 2018 109 / 127
110. Training Error and Test Error
Table: Average Training Error and Average Test Error over m = 10 random splits
of n = 300 observations generated from a population with true function
f(x) = −x +
√
2 sin(π3/2
x2
) for x ∈ [−1, +1]. The noise variance in this case is
σ2
= 0.32
. Each split has ntr= 2n/3.
Approximating Function Class
Poly SVM RVM GPR
Average
Training Error 0.0998 0.0335 0.0295 0.1861
Test Error 0.3866 0.1465 0.1481 0.1556
DZ Ý Data Science MMW 2018 October 10, 2018 110 / 127
112. Finding Patterns in Job Sector Allocations in Europe
Example 1: Consider the following portion of observations on job sectors
distribution in Europe in the 1990s.
Agr Min Man PS Con SI Fin SPS TC
Italy 15.9 0.6 27.6 0.5 10.0 18.1 1.6 20.1 5.7
Poland 31.1 2.5 25.7 0.9 8.4 7.5 0.9 16.1 6.9
Rumania 34.7 2.1 30.1 0.6 8.7 5.9 1.3 11.7 5.0
USSR 23.7 1.4 25.8 0.6 9.2 6.1 0.5 23.6 9.3
Denmark 9.2 0.1 21.8 0.6 8.3 14.6 6.5 32.2 7.1
France 10.8 0.8 27.5 0.9 8.9 16.8 6.0 22.6 5.7
1 Can European countries by divided into meaningful groups (clusters)?
2 How many concepts? How many clusters (groups) of countries?
Analogy: Clustering in such an example can be thought of as unsupervised
classification (pattern recognition)
DZ Ý Data Science MMW 2018 October 10, 2018 112 / 127
113. Hierarchical Clustering for European Job Sector Data
One solution: Mining Job Sectors in Europe in the 1990s via Hierarchical
Clustering with Manhattan distance and ward linkage.
Belgium
UnitedKingdom
Denmark
Sweden
Netherlands
Norway
France
Finland
Italy
Luxembourg
Austria
E.Germany
W.Germany
Switzerland
Spain
Rumania
Portugal
Poland
Czechoslovakia
Bulgaria
Hungary
Ireland
USSR
Turkey
Greece
Yugoslavia
050100150200250300350
Cluster Dendrogram
hclust (*, "ward")
dist(europe, method = "manhattan")
Height
How does the distance affect the
clustering?
How does the linkage affect the
clustering?
What makes a clustering
satisfactory? How does one compare
two clusterings?
Some interesting tasks:
1 Investigate different distances with same linkage
2 Investigate different linkages with same distance
DZ Ý Data Science MMW 2018 October 10, 2018 113 / 127
114. Extracting Patterns of Voting in America
Example 2: Percentages of Votes given to the U. S. Republican
Presidential Candidate - 1856-1976.
X1856 X1860 X1864 X1868 X1900 X1904 X1908
Alabama NA NA NA 51.44 34.67 20.65 24.38
Arkansas NA NA NA 53.73 35.04 40.25 37.31
California 18.77 32.96 58.63 50.24 54.48 61.90 55.46
Colorado NA NA NA NA 42.04 55.27 46.88
Connecticut 53.18 53.86 51.38 51.54 56.94 58.13 59.43
Delaware 2.11 23.71 48.20 40.98 53.65 54.04 52.09
Florida NA NA NA NA 19.03 21.15 21.58
1 Can the states be grouped into clusters of republican-ness?
2 How do missing values influence the clustering?
Analogy: Again, clustering in such an example can be thought of as
unsupervised classification (pattern recognition)
DZ Ý Data Science MMW 2018 October 10, 2018 114 / 127
115. Example: Image Denoising
For an observed image of size r × c, posit the model
y = Wx + z. (20)
The original image is represented by a p × 1 vector, which makes the
matrix W a matrix of dimension q × p, where q = rc. We therefore have
z⊤ = (z1, · · · , zq) ∈ IRq
, x⊤ = (x1, · · · , xp) ∈ IRp
,
y⊤ = (y1, · · · , yq) ∈ IRq
.
DZ Ý Data Science MMW 2018 October 10, 2018 115 / 127
116. Example: Image Denoising
Expression of the solution: If E(x) = y − Wx 2 + λ x 1 is our
objective function to be minimized, and ˆx is a point at which the
minimum is achieved, then we will write
ˆx = arg min
x∈IRp
y − Wx 2
+ λ x 1 . (21)
DZ Ý Data Science MMW 2018 October 10, 2018 116 / 127
117. Example: Recommender System
Consider a system in which n customers have access to p different
products, like movies, clothing, rental cars, etc ...
A1 A2 · · · Aj · · · Ap
C1
C2
...
Ci w(i, j)
...
Cn
Table: Typical Representation of a Recommender System
The value of w(i, j) is the rating assigned to article Aj by customer Ci.
DZ Ý Data Science MMW 2018 October 10, 2018 117 / 127
118. Example: Recommender System
The main ingredient in Recommender Systems is the matrix
W =
w11 w12 · · · w1j · · · w1p
w21 w22 · · · w2j · · · w2p
...
...
...
... · · ·
...
wi1 wi2 · · · wij · · · wip
...
...
...
... · · ·
...
wn1 wn2 · · · wnj · · · wnp
The Matrix W is typical very (and I mean very) sparse, which makes
sense because people can only consume so many articles, and there
are articles some people will never consume even if some suggested.
DZ Ý Data Science MMW 2018 October 10, 2018 118 / 127
119. Time Series and State Space
Models
DZ Ý Data Science MMW 2018 October 10, 2018 119 / 127
120. IID Process and White Noise
Time
ts(X)
0 50 100 150 200
−2−10123
Time
ts(W)
0 50 100 150 200
−2−1012
(Left) White noise process (Right) IID Process.
What is the statistical model (if any) underlying the data?
DZ Ý Data Science MMW 2018 October 10, 2018 120 / 127
121. Random Walk in 1d and 2d
Time
ts(X)
0 50 100 150 200
−4−20246810
−10 −5 0 5 10 15 20
−20−100102030
Xt
Yt
(Left) Random walk in 1 dimension (Right) Random Walk in 2
dimensions (plane).
What is the statistical model (if any) underlying the data?
DZ Ý Data Science MMW 2018 October 10, 2018 121 / 127
122. Real life Time Series: Air Passengers and Sunspots
Time
AirPassengers
1950 1952 1954 1956 1958 1960
100200300400500600
Time
Sunspots
0 20 40 60 80 100
050100150
(Left) Number of airline passengers (Right) Longstanding Sunspots
data.
What is the statistical model (if any) underlying the data?
DZ Ý Data Science MMW 2018 October 10, 2018 122 / 127
123. Existing Computing Tools
Do the following
install.packages(’ctv’)
library{ctv}
install.views(’MachineLearning’)
install.views(’HighPerformanceComputing’)
install.views(’TimeSeries’)
install.views(’Bayesian’)
R packages for big data
library{biglm}
library(foreach)
library(glmnet)
library(kernlab)
library(randomForest)
library(ada)
library(audio)
library(rpart)
DZ Ý Data Science MMW 2018 October 10, 2018 123 / 127
124. Some Remarks and Recommendations
Applications: Sharpen your intuition and your commonsense by
questioning things, reading about interesting open applied problems,
and attempt to solve as many problems as possible
Methodology: Read and learn about the fundamental of statistical
estimation and inference, get acquainted with the most commonly
used methods and techniques, and consistently ask yourself and
others what the natural extensions of the techniques could be.
Computation: Learn and master at least two programming languages.
I strongly recommend getting acquainted with R
http://www.r-project.org
Theory: ”Nothing is more practical than a good theory” (Vladimir N.
Vapnik). When it comes to data mining and machine learning and
predictive analytics, those who truly understand the inner workings of
algorithms and methods always solve problems better.
DZ Ý Data Science MMW 2018 October 10, 2018 124 / 127
125. Machine Learning CRAN Task View in R
Let’s visit the website where most of the R community goes
http://www.r-project.org
Let’s install some packages and get started
install.packages(’ctv’)
library(ctv)
install.views(’MachineLearning’)
install.views(’HighPerformanceComputing’)
install.views(’Bayesian’)
install.views(’Robust’)
Let’s load a couple of packages and explore
library(e1071)
library(MASS)
library(kernlab)
DZ Ý Data Science MMW 2018 October 10, 2018 125 / 127
126. Clarke, B. and Fokou´e, E. and
Zhang, H. (2009). Principles and
Theory for Data Mining and
Machine Learning. Springer
Verlag, New York, (ISBN:
978-0-387-98134-5), (2009)
DZ Ý Data Science MMW 2018 October 10, 2018 126 / 127
127. References
Clarke, B., Fokou´e, E. and Zhang, H. H. (2009). Principles and
Theory for Data Mining and Machine Learning. Springer Verlag,
New York, (ISBN: 978-0-387-98134-5), (2009)
James, G, Witten, D, Hastie, T and Tibshirani, R (2013). An
Introduction to Statistical Learning with Applications in R.
Springer, New York, (e-ISBN: 978-1-4614-7138-7),(2013)
Vapnik, N. V.(1998). Statistical Learning Theory. Wiley, ISBN:
978-0-471-03003-4, (1998)
Vapnik, N. V.(2000). The Nature of Statistical Learning Theory.
Springer, ISBN 978-1-4757-3264-1, (2000)
Hastie, T. and Tibshirani, R. and Friedman, J. (2009). The
Elements of Statistical Learning: Data Mining, Inference, and
Prediction. 2nd Edition. Springer, ISBN 978-0-387-84858-7
DZ Ý Data Science MMW 2018 October 10, 2018 127 / 127