SlideShare a Scribd company logo
1 of 127
Download to read offline
Foundations of Statistical Learning Theory
Quintessential Pillar of Modern Data Science
Ernest Fokou´e
DZ Ý
School of Mathematical Sciences
Rochester Institute of Technology
Rochester, New York, USA
Delivered by invitation of the
Statistical and Mathematical Sciences Institute (SAMSI)
Modern Mathematics Workshop (MMW 2018)
San Antonio, Texas, USA
October 10, 2018
DZ Ý Data Science MMW 2018 October 10, 2018 1 / 127
Acknowledgments
I wish to express my grateful thanks
and sincere gratitude to the Director
of SAMSI, Prof. Dr. David Banks,
for kindly inviting me and granting
me the golden opportunity to present
at the 2018 Modern Mathematics
Workshop in San Antonio.
I hope and pray that my modest
contribution will inspire and empower
all the attendees of my mini course.
DZ Ý Data Science MMW 2018 October 10, 2018 2 / 127
Basic Introduction to Statistical Machine Learning
Roadmap: This lecture will provide you with the basic elements of an
introduction to the foundational concepts of statistical machine learning.
Among other things, we’ll touch on foundational concepts such as:
Input space, output space, function space, hypothesis space, loss
function, risk functional, theoretical risk, empirical risk, Bayes Risk,
training set, test set, model complexity, generalization error,
approximation error, Estimation error, bounds on the generalization
error, regularization, etc ...
Relevant websites
http://www.econ.upf.edu/∼lugosi/mlss slt.pdf
https://en.wikipedia.org/wiki/Reproducing kernel Hilbert space
Kernel Machines http://www.kernel-machines.org/
R Software project website: http://www.r-project.org
DZ Ý Data Science MMW 2018 October 10, 2018 3 / 127
Traditional Pattern Recognition Applications
Statistical Machine Learning Methods and Techniques have been
successfully applied to wide variety of important fields. Amongst others:
1 The famous and somewhat ubiquitous handwritten digit recognition.
This data set is also known as MNIST, and is usually the first task in
some Data Analytics competitions. This data set is from USPS and
was first made popular by Yann LeCun, the co-inventor of Deep
Learning.
2 More recently, text mining and specific topic of text
categorization/classification has made successful use of statistical
machine learning.
3 Credit Scoring is another application that has been connected with
statistical machine learning
4 Disease diagnostics has also been tackled using statistical machine
learning
Other applications include: audio processing, speaker recognition and
speaker identification.
DZ Ý Data Science MMW 2018 October 10, 2018 4 / 127
Handwritten Digit Recognition
Handwritten digit recognition is a fascinating problem that captured the
attention of the machine learning and neural network community for many
years, and has remained a benchmark problem in the field.
0
1:28
1
1:28
2
1:28
3
1:28
4
1:28
5
1:28
6
1:28
7
1:28
8
1:28
9
1:28
DZ Ý Data Science MMW 2018 October 10, 2018 5 / 127
Handwritten Digit Recognition
Below is a portion of the benchmark training set
Note: The challenge here is building classification techniques that
accurately classify handwritten digits taken from the test set.
DZ Ý Data Science MMW 2018 October 10, 2018 6 / 127
Handwritten Digit Recognition
Below is a portion of the benchmark training set
Note: The challenge here is building classification techniques that
accurately classify handwritten digits taken from the test set.
DZ Ý Data Science MMW 2018 October 10, 2018 7 / 127
Handwritten Digit Recognition
Below is a portion of the benchmark training set
Note: The challenge here is building classification techniques that
accurately classify handwritten digits taken from the test set.
DZ Ý Data Science MMW 2018 October 10, 2018 8 / 127
Pattern Recognition (Classification) data set
pregnant glucose pressure triceps insulin mass pedigree age diabetes
6 148 72 35 0 33.60 0.63 50 pos
1 85 66 29 0 26.60 0.35 31 neg
8 183 64 0 0 23.30 0.67 32 pos
1 89 66 23 94 28.10 0.17 21 neg
0 137 40 35 168 43.10 2.29 33 pos
5 116 74 0 0 25.60 0.20 30 neg
3 78 50 32 88 31.00 0.25 26 pos
10 115 0 0 0 35.30 0.13 29 neg
2 197 70 45 543 30.50 0.16 53 pos
8 125 96 0 0 0.00 0.23 54 pos
4 110 92 0 0 37.60 0.19 30 neg
10 168 74 0 0 38.00 0.54 34 pos
10 139 80 0 0 27.10 1.44 57 neg
1 189 60 23 846 30.10 0.40 59 pos
What are the factors responsible for diabetes?
library(mlbench); data(PimaIndiansDiabetes)
DZ Ý Data Science MMW 2018 October 10, 2018 9 / 127
Pattern Recognition (Classification) data set
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 Class
0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 n
0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 n
0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 n
0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 ei
0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 ie
0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 ie
0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 ei
1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 n
0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 n
0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 n
0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 ie
1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 n
1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 ie
What are the indicators that control of promoter genes in the DNA?
library(mlbench); data(DNA)
DZ Ý Data Science MMW 2018 October 10, 2018 10 / 127
Pattern Recognition (Classification) data set
Class X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 x11 X12 X13 X14
+ g c c t t c t c c a a a a c
+ a t g c a a t t t t t t a g
+ c c g t t t a t t t t t t c
+ t c t c a a c g t a a c a c
+ t a g g c a c c c c a g g c
+ a t a t a a a a a a g t t c
+ c a a g g t a g a a t g c t
+ t t a g c g g a t c c t a c
+ c t g c a a t t t t t c t a
+ t g t a a a c t a a t g c c
+ c a c t a a t t t a t t c c
+ a g g g g c a a g g a g g a
+ c c a t c a a a a a a a t a
+ a t g c a t t t t t c c g c
+ t c a g a a a t a t t a t g
What are the indicators that control of promoter genes in the DNA?
library(kernlab); data(promotergene)
DZ Ý Data Science MMW 2018 October 10, 2018 11 / 127
Statistical Speaker Accent Recognition
Consider Xi = (xi1, · · · , xip)⊤ ∈ Rp and Yi ∈ {−1, +1}, and the set
D = (X1, Y1), (X2, Y2), · · · , (Xn, Yn)
where
Yi =
+1 if person i is a Native US
−1 if person i is a Non Native US
and Xi = (xi1, · · · , xip)⊤ ∈ Rp is the time domain representation of
his/her reading of an English sentence. The design matrix is
X =








x11 x12 · · · · · · · · · · · · x1j · · · x1p
...
...
...
... · · · · · · · · · · · ·
...
xi1 xi2 · · · · · · · · · · · · xij · · · xip
...
...
...
... · · · · · · · · · · · ·
...
xn1 xn2 · · · · · · · · · · · · xnj · · · xnp








DZ Ý Data Science MMW 2018 October 10, 2018 12 / 127
Statistical Speaker Accent Recognition
Consider this design matrix
X =








x11 x12 · · · · · · · · · · · · x1j · · · x1p
...
...
...
... · · · · · · · · · · · ·
...
xi1 xi2 · · · · · · · · · · · · xij · · · xip
...
...
...
... · · · · · · · · · · · ·
...
xn1 xn2 · · · · · · · · · · · · xnj · · · xnp








At RIT, we recently collected voices from n = 117 people.
Each sentence required about 11 seconds to be read.
At a sampling rate of 441000 Hz, each sentence requires a vector of
dimension roughly p=540000 in the time domain.
We therefore have a gravely underdetermined system with X ∈ IRn×p
where n ≪ p. Here, n=117 and p=540000.
DZ Ý Data Science MMW 2018 October 10, 2018 13 / 127
Binary Classification in the Plane, X ⊂ R2
Given {(x1, y1), · · · , (xn, yn)}, with xi ∈ X ⊂ R2 and yi ∈ {−1, +1}
−20 −10 0 10
−20−15−10−5051015
x1
x2
What is the ”best” classifier f∗ that separates the red from the green?
DZ Ý Data Science MMW 2018 October 10, 2018 14 / 127
Motivating Binary Classification in the Plane
For the binary classification problem introduced earlier:
– A collection {(x1, y1), · · · , (xn, yn)} of i.i.d. observations is given
xi ∈ X ⊂ Rp
, i = 1, · · · , n. X is the input space.
yi ∈ {−1, +1}. Y = {−1, +1} is the output space.
– What is the probability law that governs the (xi, yi)’s?
– What is the functional relationship between x and y? Namely one
considers mappings
f : X → Y
x → f(x),
– What is the ”best” approach to determining from the available
observations, the relationship f between x and y in such a way that,
given a new (unseen) observation xnew, its class ynew can be
predicted by f(xnew) as accurately and precisely as possible, that is,
with the smallest possible discrepancy.
DZ Ý Data Science MMW 2018 October 10, 2018 15 / 127
Basic Remarks on Classification
While some points clearly belong to one of the classes, there are other
points that are either strangers in a foreign land, or are positioned in
such a way that no automatic classification rule can clearly determine
their class membership.
One can construct a classification rule that puts all the points in their
corresponding classes. Such a rule would prove disastrous in
classifying new observations not present in the current collection of
observations.
Indeed, we have a collection of pairs (xi, yi) of observations coming
from some unknown distribution P(x, y).
DZ Ý Data Science MMW 2018 October 10, 2018 16 / 127
Basic Remarks on Classification
Finding an automatic classification rule that achieves the absolute
very best on the present data is not enough since infinitely many more
observations can be generated by P(x, y) for which good classification
will be required.
Even the universally best classifier will make mistakes.
Of all the functions in YX , it is reasonable to assume that there is a
function f∗ that maps any x ∈ X to its corresponding y ∈ Y, i.e.,
f∗ : X → Y
x → f∗(x),
with the minimum number of mistakes.
DZ Ý Data Science MMW 2018 October 10, 2018 17 / 127
Theoretical Risk Minimization
Let f denote any generic function mapping an element x of X to its
corresponding image f(x) in Y.
Each time x is drawn from P(x), the disagreement between the image
f(x) and the true image y is called the loss, denoted by ℓ(y, f(x)).
The expected value of this loss function with respect to the
distribution P(x, y) is called the risk functional of f. Generically, we
shall denote the risk functional of f by R(f), so that
R(f) = E[ℓ(Y, f(X))] = ℓ(y, f(x))dP(x, y).
The best function f∗ over the space YX of all measurable functions
from X to Y is therefore
f∗
= arg inf
f
R(f),
so that
R(f∗
) = R∗
= inf
f
R(f).
DZ Ý Data Science MMW 2018 October 10, 2018 18 / 127
On the need to reduce the search space
Unfortunately, f∗ can only be found if P(x, y) is known. Therefore,
since we do not know P(x, y) in practice, it is hopeless to determine
f∗.
Besides, trying to find f∗ without the knowledge of P(x, y) implies
having to search the infinite dimensional function space YX of all
mappings from X to Y, which is an ill-posed and computationally
nasty problem.
Throughout this lecture, we will seek to solve the more reasonable
problem of choosing from a function space F ⊂ YX , the one function
f· ∈ F that best estimates the dependencies between x and y.
It is therefore important to define what is meant by best estimates.
For that, the concepts of loss function and risk functional need to be
define.
DZ Ý Data Science MMW 2018 October 10, 2018 19 / 127
Loss and Risk in Pattern Recognition
For this classification/pattern recognition, the so-called 0-1 loss function
defined below is used. More specifically,
ℓ(y, f(x)) = 1{Y =f(X)} =
0 if y = f(x),
1 if y = f(x).
(1)
The corresponding risk functional is
R(f) = ℓ(y, f(x))dP(x, y) = E 1{Y =f(X)} = Pr
(X,Y )∼P
[Y = f(X)].
The minimizer of the 0-1 risk functional over all possible classifiers is the
so-called Bayes classifier which we shall denote here by f∗ given by
f∗
= arg inf
f
Pr
(X,Y )∼P
[Y = f(X)] .
Specifically, the Bayes’ classifier f∗ is given by the posterior probability of
class membership, namely
f∗
(x) = arg max
y∈Y
Pr[Y = y|x] .
DZ Ý Data Science MMW 2018 October 10, 2018 20 / 127
Bayes Learner for known situations
If p(x|y = +1) = MVN(x, µ+1, Σ) and p(x|y = −1) = MVN(x, µ−1, Σ), the
Bayes classifier f∗, the classifier that achieves the Bayes risk, coincides
with the population Linear Discriminant Analysis (LDA), fLDA, which, for
any new point x, yields the predicted class
f∗
(x) = fLDA(x) = sign β0 + β⊤
x ,
where
β = Σ−1
(µ+1 − µ−1),
and
β0 = −
1
2
(µ+1 + µ−1)⊤
Σ−1
(µ+1 − µ−1) + log
π+1
π−1
,
with π+1 = Pr[Y = +1] and π−1 = 1 − π+1 representing the prior
probabilities of class membership.
DZ Ý Data Science MMW 2018 October 10, 2018 21 / 127
Bayes Risk for known situations
î
Bayes Risk in Binary Classification under Gaussian Class Conditional
Densities with common covariance matrix: Let x = (x1, x2, · · · , xp)⊤ be a
p-dimensional vector coming from either class +1 or class −1. Let f be a
function (classifier) that seeks to map x to y ∈ {−1, +1} as accurately as
possible. Let R∗ = min
f
{Pr[f(X) = Y ]} be the Bayes Risk, i.e. the
smallest error rate among all possible f. If p(x|y = +1) = MVN(x, µ+1, Σ)
and p(x|y = −1) = MVN(x, µ−1, Σ), then
R∗
= R(f∗
) = Φ(−
√
∆/2) =
−
√
∆/2
−∞
1
√
2π
e− 1
2
z2
dz,
with
∆ = (µ+1 − µ−1)⊤
Σ−1
(µ+1 − µ−1).
DZ Ý Data Science MMW 2018 October 10, 2018 22 / 127
Loss Functions for Classification
With f : X −→ {−1, +1}, and h ∈ H such that f(x) = sign(h(x))
Zero-one (0/1) loss
ℓ(y, f(x)) = 1(y = f(x)) = 1(yh(x) < 0)
Hinge loss
ℓ(y, f(x)) = max(1 − yh(x), 0) = (1 − yh(x))+
Logistic loss
ℓ(y, f(x)) = log(1 + exp(−yh(x)))
Exponential loss
ℓ(y, f(x)) = exp(−yh(x))
DZ Ý Data Science MMW 2018 October 10, 2018 23 / 127
Loss Functions for Classification
With f : X −→ {−1, +1}, and h ∈ H such that f(x) = sign(h(x))
Zero-one (0/1) loss
ℓ(y, f(x)) = 1(yh(x) < 0)
Hinge loss
ℓ(y, f(x)) = max(1 − yh(x), 0)
Logistic loss
ℓ(y, f(x)) = log(1 + exp(−yh(x)))
Exponential loss
ℓ(y, f(x)) = exp(−yh(x))
−3 −2 −1 0 1 2 3
01234
yh(x)
δ(yh(x))
hinge loss
squared loss
logistic loss
exponential
zero−one loss
DZ Ý Data Science MMW 2018 October 10, 2018 24 / 127
Loss Functions for Regression
With f : X −→ IR, and f ∈ H.
ℓ1 loss
ℓ(y, f(x)) = |y − f(x)|
ℓ2 loss
ℓ(y, f(x)) = |y − f(x)|2
ε-insensitive ℓ1 loss
ℓ(y, f(x)) = |y − f(x)| − ε
ε-insensitive ℓ2 loss
ℓ(y, f(x)) = |y − f(x)|2
− ε
−3 −2 −1 0 1 2 3
0.00.51.01.52.0
y − f(x)
l(y,f(x))
epsi−l1 loss
epsi−l2 loss
squared loss
absolute loss
DZ Ý Data Science MMW 2018 October 10, 2018 25 / 127
Function Class in Pattern Recognition
As stated earlier, trying to find f∗ is hopeless. One needs to select a
function space F ⊂ YX , and then choose the best estimator f+ from F,
i.e.,
f+
= arg inf
f∈F
R(f),
so that
R(f+
) = R+
= inf
f∈F
R(f).
For the binary pattern recognition problem, one may consider finding the
best linear separating hyperplane, i.e.
F = f : X → {−1, +1}| ∃α0 ∈ R, (α1, · · · , αp)⊤
= α ∈ Rp
|
f(x) = sign α⊤
x + α0 , ∀x ∈ X
DZ Ý Data Science MMW 2018 October 10, 2018 26 / 127
Empirical Risk Minimization
Let D = (X1, Y1), · · · , (Xn, Yn) be an iid sample from P(x, y).
The empirical version of the risk functional is
R(f) =
1
n
n
i=1
1{Yi=f(Xi)}
We therefore seek the best by empirical standard,
f = arg min
f∈F
1
n
n
i=1
1{Yi=f(Xi)}
Since it is impossible to search all possible functions, it is usually
crucial to choose the ”right” function space F.
DZ Ý Data Science MMW 2018 October 10, 2018 27 / 127
Bias-Variance Trade-Off
In traditional statistical estimation, one needs to address at the very least
issues like: (a) the Bias of the estimator; (b) the Variance of the
estimator; (c) The consistency of the estimator; Recall from elementary
point estimation that, if θ is the true value of the parameter to be
estimated, and θ is a point estimator of θ, then one can decompose the
total error as follows:
θ − θ = θ − E[θ]
Estimation error
+ E[θ] − θ
Bias
(2)
Under the squared error loss, one seeks θ that minimizes the mean squared
error,
θ = arg min
θ∈Θ
E[(θ − θ)2
] = arg min
θ∈Θ
MSE(θ),
rather than trying to find the minimum variance unbiased estimator
(MVUE).
DZ Ý Data Science MMW 2018 October 10, 2018 28 / 127
Bias-Variance Trade-off
Clearly, the traditional so-called bias-variance decomposition of the MSE
reveals the need for bias-variance trade-off. Indeed,
MSE(θ) = E[(θ − θ)2
] = E[(θ − E[θ])2
] + E[(E[θ] − θ)2
]
= variance + bias2
If the estimator θ were to be sought from all possible value of θ, then it
might make sense to hope for the MVUE. Unfortunately - an especially in
function estimation as we clearly argued earlier - there will be some bias,
so that the error one gets has a bias component along with the variance
component in the squared error loss case. If the bias is too small, then an
estimator with a larger variance is obtained. Similarly, a small variance will
tend to come from estimators with a relatively large bias. The best
compromise is then to trade-off bias and variance. Which is in functional
terms translates into trade-off between approximation error and estimation
error.
DZ Ý Data Science MMW 2018 October 10, 2018 29 / 127
Bias-Variance Trade-off
Optimal Smoothing
Less smoothing
Bias squared
True Risk
More smoothing
Variance
Figure: Illustration of the qualitative behavior of the dependence of bias versus variance on a tradeoff parameter such as λ
or h. For small values the variability is too high; for large values the bias gets large.
DZ Ý Data Science MMW 2018 October 10, 2018 30 / 127
Structural risk minimization principle
Since making the estimator of the function arbitrarily complex causes the
problems mentioned earlier, the intuition for a trade-off reveals that instead
of minimizing the empirical risk Rn(f) one should do the following:
Choose a collection of function spaces {Fk : k = 1, 2, · · · }, maybe a
collection of nested spaces (increasing in size)
Minimize the empirical risk in each class
Minimize the penalized empirical risk
min
k
min
f∈Fk
Rn(f) + penalty(k, n)
where penalty(k, n) gives preference to models with small estimation error.
It is important to note that penalty(k, n) measures the capacity of the
function class Fk. The widely used technique of regularization for solving
ill-posed problem is a particular instance of structural risk minimization.
DZ Ý Data Science MMW 2018 October 10, 2018 31 / 127
Regularization for Complexity Control
Tikhonov’s Variation Approach to Regularization[Tikhonov, 1963]
Find f that minimizes the functional
R(reg)
n (f) =
1
n
n
i=1
ℓ(yi, f(xi)) + λΩ(f)
where λ > 0 is some predefined constant.
Ivanov’s Quasi-solution Approach to Regularization[Ivanov, 1962]
Find f that minimizes the functional
Rn(f) =
1
n
n
i=1
ℓ(yi, f(xi))
subject to the constraint
Ω(f) ≤ C
where C > 0 is some predefined constant.
DZ Ý Data Science MMW 2018 October 10, 2018 32 / 127
Regularization for Complexity Control
Philips’ Residual Approach to Regularization[Philips, 1962]
Find f that minimizes the functional
Ω(f)
subject to the constraint
1
n
n
i=1
ℓ(yi, f(xi)) ≤ µ
where µ > 0 is some predefined constant.
In all the above, the functional Ω(f) is called the regularization functional.
Ω(f) is defined in such a way that it controls the complexity of the
function f.
Ω(f) = f 2
=
b
a
(f′′
(t))2
dt.
is a regularization functional used in spline smoothing.
DZ Ý Data Science MMW 2018 October 10, 2018 33 / 127
Support Vector Machines and the Hinge Loss
Let’s consider h(x) = w⊤x + b, w ∈ IRp
, b ∈ IR and the classifier
f(x) = sign(h(x)) = sign(w⊤
x + b).
Recall the hinge loss defined as
ℓ(y, f(x)) = (1−yh(x))+ =
0 if yh(x) > 0 correct prediction
1 − yh(x) if yh(x) < 0 wrong prediction
−4 −2 0 2 4
012345
yf(x)
hinge(y,f(x))
DZ Ý Data Science MMW 2018 October 10, 2018 34 / 127
Support Vector Machines and the Hinge Loss
The Support Vector Machine classifier can be formulated as
Minimize E(w, b) =
1
n
n
i=1
(1 − yi(w⊤
xi + b))+
subject to
w 2
2 < τ.
Which is equivalent in regularized (lagrangian) form to
(w, b) = arg min
w∈Rq
1
n
n
i=1
(1 − yi(w⊤
xi + b))+ + λ w 2
2
The SVM linear binary classification estimator is given by
fn(x) = sign(h(x)) = sign(w⊤
x + b)
where w and b are estimators of w and b respectively.
DZ Ý Data Science MMW 2018 October 10, 2018 35 / 127
Classification realized with Linear Boundary
SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1
Figure: Linear SVM classifier with a relatively small margin
DZ Ý Data Science MMW 2018 October 10, 2018 36 / 127
Classification realized with Linear Boundary
SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1
Figure: Linear SVM classifier with a relatively large margin
DZ Ý Data Science MMW 2018 October 10, 2018 37 / 127
SVM Learning via Quadratic Programming
When the decision boundary is nonlinear, the αi’s in the expression of
the support vector machine classifier ˆf are determined by solving the
following quadratic programming problem
Maximize E(α) =
n
i=1
αi −
1
2
n
i=1
n
j=1
αiαjyiyjK(xi, xj).
subject to
0 ≤ αi ≤ C (i = 1, · · · , n) and
n
i=1
αiyi = 0.
The above formulation is an instance of the general QP
Maximize −
1
2
α⊤
Qα + 1⊤
α
subject to
α⊤
y = 0 and αi ∈ [0, C], ∀i ∈ [n].
n×nDZ Ý Data Science MMW 2018 October 10, 2018 38 / 127
SVM Learning via Quadratic Programming in R
The quadratic programming problem
Maximize −
1
2
α⊤
Qα + 1⊤
α
subject to α⊤y = 0 and αi ∈ [0, C], ∀i ∈ [n]. is equivalent to
Minimize
1
2
α⊤
Qα − 1⊤
α
subject to α⊤y = 0 and αi ∈ [0, C], ∀i ∈ [n].
Which is solved with the R package kernlab via the function ipop()
Minimize c⊤
α +
1
2
α⊤
Hα
subject to b ≤ Aα ≤ b + r and l ≤ α ≤ u.
DZ Ý Data Science MMW 2018 October 10, 2018 39 / 127
Support Vector Machines and Kernels
As a result of the kernelization, the SVM classifier delivers for each x,
the estimated response
fn(x) = sign


|s|
j=1
ˆαsj ysj K(xsj , x) + ˆb


where sj ∈ {1, 2, · · · , n}, s = {s1, s2, · · · , s|s|} and |s| ≪ n.
The kernel K(·, ·) is a bivariate function K : X × X −→ IR+ such
that given xl, xm ∈ X, the value of
K(xl, xm) = Φ(xl), Φ(xm) = Φ(xl)⊤
Φ(xm)
represents the similarity between xl and xm, and corresponds to an
implicit inner product in some feature space Z of dimension higher
than dim(X), where the decision boundary is conveniently a large
margin separating hyperplane.
Trick: There is never any need in practice to explicitly manipulated
the higher dimensional feature mapping Φ : X −→ Z.
DZ Ý Data Science MMW 2018 October 10, 2018 40 / 127
Classification realized with Nonlinear Boundary
SVM Optimal Separating and Margin Hyperplanes
Figure: Nonlinear SVM classifier with a relatively small margin
DZ Ý Data Science MMW 2018 October 10, 2018 41 / 127
Interplay between the aspects of statistical learning
DZ Ý Data Science MMW 2018 October 10, 2018 42 / 127
Statistical Consistency
Definition: Let θn be an estimator of some scalar quantity θ based
on an i.i.d. sample X1, X2, · · · , Xn from the distribution with
parameter θ. Then, θn is said to be a consistent estimator of θ, if θn
converges in probability to θ, i.e.,
θn
P
−→
n→∞
θ.
In other words, θn is a consistent estimator of θ if, ∀ǫ > 0,
lim
n→∞
Pr |θn − θ| > ǫ = 0.
It turns out that for unbiased estimators θn, consistency is
straightforward as direct consequence of a basic probabilistic
inequality like Chebyshev’s inequality. However, for unbiased
estimators, one has to be more careful.
DZ Ý Data Science MMW 2018 October 10, 2018 43 / 127
A Basic Important Inequality
ê¦
(Biename-Chebyshev’s inequality) Let X be a random variable with finite
mean µX = E[X] i.e. |E[X]| < +∞ and finite variance σ2
X = V(X) , i.e.,
|V(X)| < +∞. Then, ∀ǫ > 0,
Pr[|X − E[X]| > ǫ] ≤
V(X)
ǫ2
.
It is therefore easy to see here that, with unbiased θn, one has E[θn] = θ,
and the result is immediate. For the sake of clarity, let’s recall here the
elementary weak law of large numbers.
DZ Ý Data Science MMW 2018 October 10, 2018 44 / 127
Weak Law of Large Numbers
Let X be a random variable with finite mean µX = E[X] i.e.
|E[X]| < +∞ and finite variance σ2
X = V(X) , i.e., |V(X)| < +∞. Let
X1, X2, · · · , Xn be a random sample of n observations drawn
independently from the distribution of X, so that for i = 1, · · · , n, we
have E[Xi] = µ and V[Xi] = σ2 . Let ¯Xn be the sample mean, i.e.,
¯Xn =
1
n
(X1 + X2 + · · · + Xn) =
1
n
n
i=1
Xi
Then, clearly, E[ ¯Xn] = µ, and, ∀ǫ > 0,
lim
n→∞
Pr[| ¯Xn − µ| > ǫ] = 0. (3)
This essentially expresses the fact that the empirical mean ¯Xn converges
in probability to the theoretical mean µ in the limit of very large samples.
DZ Ý Data Science MMW 2018 October 10, 2018 45 / 127
Weak Law of Large Numbers
We therefore have
¯Xn
P
−→
n→∞
µ.
With µ ¯X = E[ ¯Xn] = µ and σ2
¯X
= σ2/n, one applyies
Biename-Chebyshev’s inequality and gets: ∀ǫ > 0,
Pr[| ¯X − µ| > ǫ] ≤
σ2
nǫ2
, (4)
which, by inversion, is the same as
| ¯X − µ| <
1
δ
σ2
n
(5)
with probability at least 1 − δ.
Why is all the above of any interest to statistical learning theory?
DZ Ý Data Science MMW 2018 October 10, 2018 46 / 127
Weak Law of Large Numbers
Why is all the above of any interest to statistical learning theory?
Equation (3) states the much needed consistency of ¯X as an
estimator of µ.
Equation (4), by showing the dependence of on n and ε helps assess
the rate at which ¯X converges to µ.
Equation (5), by showing a confidence interval helps compute bounds
on the unknown true mean µ as a function of the empirical mean ¯X
and the confidence level 1 − δ.
Finally, how does go about constructing estimators with all the above
properties.
DZ Ý Data Science MMW 2018 October 10, 2018 47 / 127
Effect of Bias-Variance Dilemma of Prediction
Optimal Prediction achieved at the point of bias-variance trade-off.
DZ Ý Data Science MMW 2018 October 10, 2018 48 / 127
Theoretical Aspects of Statistical Learning
For binary classification using the so-called 0/1 loss function, the
Vapnik-Chervonenkis inequality takes the form
P sup
f∈F
| ˆRn(f) − R(f)| > ε ≤ 8S(F, n)e−nε2/32
(6)
which is also expression in terms of expectation as
E sup
f∈F
| ˆRn(f) − R(f)| ≤ 2
log S(F, n) + log 2
n
(7)
The quantity S(F, n) plays an important role of the CV Theory and
will explored in greater details later.
Note that these bounds including the one presented earlier in the VC
Fundamental Machine Learning Theorem are not asymptotic bounds.
They hold for any n.
The bounds are nice and easy if h or S(F, n) is known.
Unfortunately the bound may exceed 1, making it useless.
DZ Ý Data Science MMW 2018 October 10, 2018 49 / 127
Components of Statistical Machine Learning
Interestingly, all those 4 components of classical estimation theory, will be
encountered again in statistical learning theory. Essentially, the 4
components of statistical learning theory consist of finding the answers to
the following questions:
(a) What are the necessary and sufficient conditions for the
consistency of a learning process based on the ERM principle? This
leads to the Theory of consistency of learning processes.
(b) How fast is the rate of convergence of the learning process? This
leads to the Nonasymptotic theory of the rate of convergence of
learning processes;
(c) How can one control the rate of convergence (the generalization
ability) of the learning process?. This leads to the Theory of
controlling the generalization ability of learning processes;
(d) How can one construct algorithms that can control the
generalization ability of the learning process?. This leads to Theory of
constructing learning algorithms.
DZ Ý Data Science MMW 2018 October 10, 2018 50 / 127
Error Decomposition revisited
A reasoning on error decomposition and consistency of estimators along
with rates, bounds and algorithms applies to function spaces: indeed, the
difference between the true risk R(fn) associated with fn and the overall
minimum risk R∗ can be decomposed to explore in greater details the
source of error in the function estimation process:
R(fn) − R∗
= R(fn) − R(f+
)
Estimation error
+ R(f+
) − R∗
Approximation error
(8)
A reasoning similar to bias-variance trade-off and consistency can be
made, with the added complication brought be the need to distinguish
between the true risk functional and the empirical risk functional, and also
to the added to assess both pointwise behaviors and uniform behaviors. In
a sense, one needs to generalize the decomposition and the law of large
numbers to function spaces.
DZ Ý Data Science MMW 2018 October 10, 2018 51 / 127
Approximation-Estimation Trade-Off
Optimal Smoothing
Less smoothing
Bias squared
True Risk
More smoothing
Variance
Figure: Illustration of the qualitative behavior of the dependence of bias versus
variance on a tradeoff parameter such as λ or h. For small values the variability is
too high; for large values the bias gets large.
DZ Ý Data Science MMW 2018 October 10, 2018 52 / 127
Consistency of the Empirical Risk Minimization principle
The ERM principle is consistent if it provides a sequence of functions
ˆfn, n = 1, 2, · · · for which both the expected risk R(fn) and the
empirical risk Rn(fn) converge to the minimal possible value of the
risk R(f+) in the function class under consideration, i.e.,
R( ˆfn)
P
−→
n→∞
inf
f∈F
R(f) = R(f+
)
and
Rn( ˆfn)
P
−→
n→∞
inf
f∈F
R(f) = R(f+
)
Vapnik discusses the details of this theorem at length, and extends
the exploration to include the difference between what he calls trivial
consistency and non-trivial consistency.
DZ Ý Data Science MMW 2018 October 10, 2018 53 / 127
Consistency of the Empirical Risk Minimization principle
To better understand consistency in function spaces, consider the
sequence of random variables
ξn
= sup
f∈F
R(f) − Rn(f) , (9)
and consider studying
lim
n→∞
P sup
f∈F
R(f) − Rn(f) > ε = 0, ∀ε > 0.
Vapnik shows that the sequence of the means of the random variable
ξn converges to zero as the number n of observations increases.
He also remarks that the sequence of random variables ξn converges
in probability to zero if the set of functions F, contains a finite
number m of elements. We will show that later in the case of pattern
recognition.
DZ Ý Data Science MMW 2018 October 10, 2018 54 / 127
Consistency of the Empirical Risk Minimization principle
It remains then to describe the properties of the set of functions F,
and probability measure P(x, y) under which the sequence of random
variables ξn converges in probability to zero.
lim
n→∞
P sup
f∈F
[R(f) − Rn(f)] > ε or sup
f∈F
[Rn(f) − R(f)] > ε = 0.
Recall that Rn(f) is the realized disagreement between classifier f
and the truth about the label y of x based on information contained
in the sample D.
It is easy to see that, for a given (fixed) function (classifier) f,
E[Rn(f)] = R(f). (10)
Note that while this pointwise unbiasedness of the empirical risk is a
good bottomline property to have, it is not enough. More is needed
as the comparison is against R(f+) or event better yet R(f∗).
DZ Ý Data Science MMW 2018 October 10, 2018 55 / 127
Consistency of the Empirical Risk
Remember that the goal of statistical function estimation is to devise
a technique (strategy) that chooses from the function class F, the
one function whose true risk is as close as possible to the lowest risk
in class F.
The question arises: since one cannot calculate the true error, how
can one devise a learning strategy for choosing classifiers based on it?
Tentative answer: At least devise strategies that yield functions for
which the upper bound on the theoretical risk is as tight as possible,
so that one can make confidence statements of the form:
With probability 1 − δ over an i.i.d. draw of some sample according
to the distribution P, the expected future error rate of some classifier
is bounded by a function g(δ, error rate on sample) of δ and the error
rate on sample.
Pr TestError ≤ TrainError + φ(n, δ, κ(F)) ≥ 1 − δ
DZ Ý Data Science MMW 2018 October 10, 2018 56 / 127
Foundation Result in Statistical Learning Theory
Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of
functions implementing so learning machines, and let ζ = V Cdim(F) be
the VC dimension of F. Let the theoretical and the empirical risks be
defined as earlier and consider any data distribution in the population of
interest. Then ∀f ∈ F, the prediction error (theoretical risk) is bounded by
R(f) ≤ ˆRn(f) +
ζ log 2n
ζ + 1 − log η
4
n
(11)
with probability of at least 1 − η. or
Pr TestError ≤ TrainError +
ζ log 2n
ζ + 1 − log η
4
n
≥ 1 − η
DZ Ý Data Science MMW 2018 October 10, 2018 57 / 127
Optimism of the Training Error
5 10 15 20
0.000.050.100.15
Complexity
Predictionerror
E[Training Error]
E[Test Error]
DZ Ý Data Science MMW 2018 October 10, 2018 58 / 127
Bounds on the Generalization Error
For instance, using Chebyshev’s inequality and the fact that
E[Rn(f)] = R(f), it is easy to see that, for given classifier f and a sample
D = {(x1, y1), · · · , (xn, yn)},
Pr[|Rn(f) − R(f)| > ǫ] ≤
R(f)(1 − R(f))
nǫ2
.
To estimate the true but unknown error R(f) with a probability of at least
1 − δ, it makes sense to use inversion, i.e., set
δ =
R(f)(1 − R(f))
nǫ2
, so that ǫ =
R(f)(1 − R(f))
nδ
.
Owing to the fact that max
R(f)∈[0,1]
R(f)(1 − R(f)) = 1
4 , we have
R(f)(1 − R(f))
nδ
<
1
4nδ
=
1
4nδ
1/2
.
DZ Ý Data Science MMW 2018 October 10, 2018 59 / 127
Bounds on the Generalization Error
Based on Chebyshev’s inequality, for a given classifier f, with a
probability of at least 1 − δ, the bound on the difference between the
true risk R(f) and the empirical risk Rn(f) is given by
|Rn(f) − R(f)| <
1
4nδ
1/2
.
Recall that one of the goals of statistical learning theory is to assess
the rate of convergence of the empirical risk to the true risk, which
translates into assessing how tight the corresponding bounds on the
true risk are.
In fact, it turns out many bounds can be so loose as to become
useless. It turns out that the above Chebyshev-based bound is not a
good one, at least compared to bounds obtained using the so-called
hoeffding’s inequality.
DZ Ý Data Science MMW 2018 October 10, 2018 60 / 127
Bounds on the Generalization Error
Theorem:(Hoeffding’s inequality) Let Z1, Z2, · · · , Zn be a collection
of i.i.d random variables with Zi ∈ [a, b]. Then, ∀ǫ > 0,
Pr
1
n
n
i=1
Zi − E[Z] > ǫ ≤ 2 exp
−2nǫ2
(b − a)2
corollary:(hoeffding’s inequality for sample proportions) Let
Z1, Z2, · · · , Zn be a collection of i.i.d random variables from a
Bernoulli distribution with ”success” probability p. Let
pn = 1
n
n
i=1 Zi. Clearly, pn ∈ [0, 1] and E[pn] = p.
Therefore, as a direct consequence of the above theorem, we have,
∀ǫ > 0,
Pr [|pn − p| > ǫ] ≤ 2 exp −2nǫ2
DZ Ý Data Science MMW 2018 October 10, 2018 61 / 127
Bounds on the Generalization Error
So we have, ∀ǫ > 0,
Pr [|pn − p| > ǫ] ≤ 2 exp −2nǫ2
Now, setting δ = 2 exp(−2ǫ2n), it is straightforward to see that the
hoeffding-based 1 − δ level confidence bound on the difference
between R(f) and Rn(f) for a fixed classifier f is given by
|Rn(f) − R(f)| <
ln 2
δ
2n
1/2
.
Which of the two bounds is tighter? Clearly, we need to find out
which of ln 2/δ or 1/2δ is larger. This is the same as comparing
exp(1/2δ) and 2/δ, which in turns means comparing a(2/δ) and 2/δ
where a = exp(1/4). With δ > 0, a(2/δ) > 2/δ, so that, we know
that hoeffding’s bounds are tighter. The graph also confirm this.
DZ Ý Data Science MMW 2018 October 10, 2018 62 / 127
Bounds on the Generalization Error
0 2000 4000 6000 8000 10000 12000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
n = Sample size
Theoreticalboundf(n,δ)
Chernoff vs Chebyshev bounds for proportions: delta = 0.01
Chernoff
Chebyshev
0 2000 4000 6000 8000 10000 12000
0
0.05
0.1
0.15
0.2
0.25
n = Sample size
Theoreticalboundf(n,δ)
Chernoff vs Chebyshev bounds for proportions: delta = 0.05
Chernoff
Chebyshev
DZ Ý Data Science MMW 2018 October 10, 2018 63 / 127
Beyond Chernov and Hoeffding
In all the above, we only addressed pointwise convergence of
Rn(f) to R(f), i.e., for Fix a machine f ∈ F, we studied the
convergence of
Rn(f) to R(f).
Needless to mention that that pointwise convergence is of very little
use here.
A more interesting issue to address is uniform convergence. That is,
for all machines, f ∈ F, determine the necessary and sufficient
conditions for the convergence of
sup
f∈F
|Rn(f) − R(f)| > ǫ to 0.
Clearly, such a study extends the Law of Large Numbers to function
spaces, thereby providing tools for the construction of bounds on the
theoretical errors of learning machines.
DZ Ý Data Science MMW 2018 October 10, 2018 64 / 127
Beyond Chernov and Hoeffding
Since uniform convergence requires the consideration of the entirety
of the function space of interest, care needs to be taken regarding the
dimensionality of the function space.
Uniform convergence will prove substantially easier to handle for finite
sample spaces than for infinite dimensional function spaces.
Indeed, infinity dimensional spaces, one will need to introduce such
concepts of the capacity of the function space, measured through
devices such as the VC-dimension and covering numbers.
DZ Ý Data Science MMW 2018 October 10, 2018 65 / 127
Beyond Chernov and Hoeffding
Theorem: If Rn(f) and R(f) are close for all f ∈ F, i.e., ∀ǫ > 0,
sup
f∈F
|Rn(f) − R(f)| ≤ ǫ,
then
R(fn) − R(f+
) ≤ 2ǫ.
Proof:Recall that we did define fn as the best function that is yielded by
the empirical risk Rn(f) in the function class F. Recall also that Rn(fn)
can be made as small as possible as we saw earlier. Therefore, with f+
being the best true risk in class F, we always have
Rn(f+
) − Rn(fn) ≥ 0.
As a result,
R(fn) = R(fn) − R(f+
) + R(f+
)
= Rn(f+
) − Rn(fn) + R(fn) − R(f+
) + R(f+
)
≤ 2sup
f∈F
|R(f) − Rn(f)| + R(f+
)
DZ Ý Data Science MMW 2018 October 10, 2018 66 / 127
Beyond Chernov and Hoeffding
Proof:Recall that we did define fn as the best function that is yielded by
the empirical risk Rn(f) in the function class F. Recall also that Rn(fn)
can be made as small as possible as we saw earlier. Therefore, with f+
being the best true risk in class F, we always have
Rn(f+
) − Rn(fn) ≥ 0.
As a result,
R(fn) = R(fn) − R(f+
) + R(f+
)
= Rn(f+
) − Rn(fn) + R(fn) − R(f+
) + R(f+
)
≤ 2sup
f∈F
|R(f) − Rn(f)| + R(f+
)
Consequently,
R(fn) − R(f+
) ≤ 2sup
f∈F
|R(f) − Rn(f)|
as required.
DZ Ý Data Science MMW 2018 October 10, 2018 67 / 127
Beyond Chernov and Hoeffding
Corollary: A direct consequence of the above theorem is the following:
For a given machine f ∈ F,
R(f) ≤ Rn(f) +
ln 2
δ
2n
1/2
with probability at least 1 − δ, ∀δ > 0.
If the function class F is finite, ie
F = {f1, f2, · · · , fm}
where m = |F| = #F = Number of functions in the class F then it
can be shown that, for all f ∈ F,
R(f) ≤ Rn(f) +
ln m + ln 2
δ
2n
1/2
with probability at least 1 − δ, ∀δ > 0.
DZ Ý Data Science MMW 2018 October 10, 2018 68 / 127
Beyond Chernov and Hoeffding
It can also be shown that
R( ˆfn) ≤ Rn(f+
) + 2
ln m + ln 2
δ
2n
1/2
(12)
with probability at least 1 − δ, ∀δ > 0, where as before
f+
= arg inf
f∈F
R(f) and ˆfn = argmin
f∈F
Rn(f).
Equation (12) is of foundational importance, because it reveals clearly
that the size of the function class controls the uniform bound on the
crucial generalization error: Indeed, if the size m of the function class
F increases, then R(f+) is caused to increase while R(fn) decreases,
so that the trade-off between the two is controlled by the size m of
the function class.
DZ Ý Data Science MMW 2018 October 10, 2018 69 / 127
Vapnik-Chervonenkis Dimension
Definition: (Shattering) Let X = ∅ be any non empty domain. Let
F ⊆ 2X be any non-empty class of functions having X as their
domain. Let S ⊆ X be any finite subset of the domain X. Then S is
said to be shattered by F iff
{S ∩ f | f ∈ F} = 2S
In other words, F shatters S if any subset of S can be obtained by
intersecting S with some set from F.
Example: A class F ⊆ 2X of classifiers is said to shatter a set
x1, x2, · · · , xn of n points, if, for any possible configuration of labels
y1, y2, · · · , yn, we can find a classifier f ∈ F that reproduces those
labels.
DZ Ý Data Science MMW 2018 October 10, 2018 70 / 127
Vapnik-Chervonenkis Dimension
Definition(VC-dimension) Let X = ∅ be any non empty learning
domain. Let F ⊆ 2X be any non-empty class of functions having X
as their domain. Let S ⊆ X be any finite subset of the domain X.
The VC dimension of F is the cardinality of the largest finite set
S ⊆ X that is shattered by F, ie
V Cdim(F) := max |S| : S is shattered by F
Note: If arbitrarily large finite sets are shattered by F, then
V Cdim(F) = ∞. In other words, if a small set of finite cardinality
cannot be found that is shattered by F, then V Cdim(F) = ∞.
Example: The VC dimension of a class F ⊆ 2X of classifiers is the
largest number of points that F can shatter.
DZ Ý Data Science MMW 2018 October 10, 2018 71 / 127
Vapnik-Chervonenkis Dimension
Remarks: If V Cdim(F) = d, then there exists a finite set S ⊆ X
such that |S| = d and S is shattered by F. Importantly, every set
S ⊆ X such that |S| > d is not shattered by F. Clearly, we do not
expect to learn anything until we have at least d training points.
Intuitively, this means that an infinite VC dimension is not desirable
as it could imply the impossibility to learn the concept underlying any
data from the population under consideration. However, a finite VC
dimension does not guarantee the learnability of the concept
underlying any data from the population under consideration either.
Fact: Let F be any finite function (concept) class. Then, since it
requires 2d distinct concepts to shatter a set of cardinality d, no set of
cardinality greater than log |F| can be shattered. Therefore, log |F| is
always an upper bound for the VC dimension of finite concept classes.
DZ Ý Data Science MMW 2018 October 10, 2018 72 / 127
Vapnik-Chervonenkis Dimension
To gain insights into the central concept of VC dimension, we herein
consider a few examples of practical interest for which the VC
dimension can be found.
VC dimension of the space of separating hyperplanes: Let
X = Rp be the domain for the binary Y ∈ {−1, +1} classification
task, and consider using hyperplanes to separate the points of X. Let
F denote the class of all such separating hyperplanes. Then,
V Cdim(F) = p + 1
Intuitively, the following pictures for the case of X = R2 help see why
the VC dimension is p + 1.
DZ Ý Data Science MMW 2018 October 10, 2018 73 / 127
Foundation Result in Statistical Learning Theory
Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of
functions implementing so learning machines, and let ζ = V Cdim(F) be
the VC dimension of F. Let the theoretical and the empirical risks be
defined as earlier and consider any data distribution in the population of
interest. Then ∀f ∈ F, the prediction error (theoretical risk) is bounded by
R(f) ≤ ˆRn(f) +
ζ log 2n
ζ + 1 − log η
4
n
(13)
with probability of at least 1 − η. or
Pr TestError ≤ TrainError +
ζ log 2n
ζ + 1 − log η
4
n
≥ 1 − η
DZ Ý Data Science MMW 2018 October 10, 2018 74 / 127
Confidence Interval for a proportion
p ∈ ˆp − zα/2
ˆp(1−ˆp)
n , ˆp + zα/2
ˆp(1−ˆp)
n with 100(1 − α)% confidence
0.0 0.2 0.4 0.6
020406080100
Building 95 % CIs. Here 98 intervals out of 100 contain p. That is 98 %
lower bound and upper bound of interval
Sampleindex
DZ Ý Data Science MMW 2018 October 10, 2018 75 / 127
Confidence Interval for a proportion
p ∈ ˆp − zα/2
ˆp(1−ˆp)
n , ˆp + zα/2
ˆp(1−ˆp)
n with 100(1 − α)% confidence
0.2 0.4 0.6 0.8 1.0
020406080100
Building 95 % CIs. Here 94 intervals out of 100 contain p. That is 94 %
lower bound and upper bound of interval
Sampleindex
DZ Ý Data Science MMW 2018 October 10, 2018 76 / 127
Confidence Interval for a proportion
p ∈ ˆp − zα/2
ˆp(1−ˆp)
n , ˆp + zα/2
ˆp(1−ˆp)
n with 100(1 − α)% confidence
0.2 0.4 0.6 0.8
020406080100
Building 90 % CIs. Here 92 intervals out of 100 contain p. That is 92 %
lower bound and upper bound of interval
Sampleindex
DZ Ý Data Science MMW 2018 October 10, 2018 77 / 127
Confidence Interval for a population mean
µ ∈ ¯x − zα/2
σ2)
n , ¯x + zα/2
σ2
n with 100 × (1 − α)% confidence
8 9 10 11
020406080100
Building 95 % CIs. Here 98 intervals out of 100 contain mu. That is 98 %
lower bound and upper bound of interval
Sampleindex
DZ Ý Data Science MMW 2018 October 10, 2018 78 / 127
Confidence Interval for a population mean
µ ∈ ¯x − zα/2
σ2)
n , ¯x + zα/2
σ2
n with 100 × (1 − α)% confidence
8 9 10 11
020406080100
Building 85 % CIs. Here 90 intervals out of 100 contain mu. That is 90 %
lower bound and upper bound of interval
Sampleindex
DZ Ý Data Science MMW 2018 October 10, 2018 79 / 127
Effect of Bias-Variance Dilemma of Prediction
Optimal Prediction achieved at the point of bias-variance trade-off.
DZ Ý Data Science MMW 2018 October 10, 2018 80 / 127
VC Bound for Separating Hyperplanes
î
Let L represent the function class of binary classifiers in q-dimension, ie
L = f : ∃w ∈ IRq
, w0 ∈ IR, f(x) = sign(w⊤
x + w0), ∀x ∈ X ,
then VCDim(L) = h = q + 1.
With labels taken from {−1, +1}, and using the 0/1 loss function, we
have the fundamental theorem from Vapnik and Chervonenkis, namely,
ê¦
For every f ∈ L, and n > h, with probability at least 1 − η, we have
R(f) ≤ Rn(f) +
h log 2n
h + 1 + log 4
η
n
The above result holds true for LDA.DZ Ý Data Science MMW 2018 October 10, 2018 81 / 127
Appeal of the VC Bound
Note: One of the greatest appeals of the VC bound is that, though
applicable to function classes of infinite dimension, it preserves the
same intuitive form as the bound derived for finite dimensional F.
Essentially, using the VC dimension concept, the number L of
possible labeling configurations obtainable from F with
V Cdim(F) = ζ over 2n points verifies
L ≤
en
ζ
ζ
. (14)
The VC bound is simply obtained by replacing log |F| with L in the
expression of the risk bound for finite dimensional F.
The most important part of the above theorem is the fact that the
generalization ability of a learning machine depends on both the
empirical risk and the complexity of the class of functions used, which
is measured here by the VC dimension of (Vapnik and
Chervonenkis, 1971).
DZ Ý Data Science MMW 2018 October 10, 2018 82 / 127
Appeal of the VC Bound
Also, the bounds offered here are distribution-free, since no
assumption is made about the distribution of the population.
The details of this important result will be discussed again in chapter
6 and 7, where we will present other measures of the capacity of a
class of functions.
Remark: From the expression of the VC Bound, it is clear that an
intuitively appealing way to improve the predictive performance
(reduce prediction error) of a class of machines is to achieve a
trade-off (compromise) between small VC dimension and
minimization of the empirical risk.
At first, it may seen as if the VC bound is acting in a way similar to
the number of parameters, since it serves as a measure of the
complexity of F. In this spirit, the following is a possible guiding
principle.
DZ Ý Data Science MMW 2018 October 10, 2018 83 / 127
Appeal of the VC Bound
At first, it may seen as if the VC bound is acting in a way similar
to the number of parameters, since it serves as a measure of the
complexity of F. In this spirit, the following is a possible guiding
principle.
Intuition: One should seek to construct a classifier that
achieves the best trade-off (balance, compromise) between
complexity of function class - measured by VC dimension- and fit
to the training data -measured by empirical risk.
Now equipped with this sound theoretical foundation one can
then go on to the implementation of varioous learning machines.
We shall use R to discover some of the most commonly learning
machines.
DZ Ý Data Science MMW 2018 October 10, 2018 84 / 127
Regression Analysis
DZ Ý Data Science MMW 2018 October 10, 2018 85 / 127
Regression Analysis Dataset
rating complaints privileges learning raises critical advance
43 51 30 39 61 92 45
63 64 51 54 63 73 47
71 70 68 69 76 86 48
61 63 45 47 54 84 35
81 78 56 66 71 83 47
43 55 49 44 54 49 34
58 67 42 56 66 68 35
71 75 50 55 70 66 41
72 82 72 67 71 83 31
67 61 45 47 62 80 41
64 53 53 58 58 67 34
67 60 47 39 59 74 41
69 62 57 42 55 63 25
What are the factors that drive the rating of companies?
head(attitude)
DZ Ý Data Science MMW 2018 October 10, 2018 86 / 127
Regression Analysis Dataset
lcavol lweight age lbph svi lcp gleason pgg45 lpsa
-0.58 2.77 50 -1.39 0 -1.39 6 0 -0.43
-0.99 3.32 58 -1.39 0 -1.39 6 0 -0.16
-0.51 2.69 74 -1.39 0 -1.39 7 20 -0.16
-1.20 3.28 58 -1.39 0 -1.39 6 0 -0.16
0.75 3.43 62 -1.39 0 -1.39 6 0 0.37
-1.05 3.23 50 -1.39 0 -1.39 6 0 0.77
0.74 3.47 64 0.62 0 -1.39 6 0 0.77
0.69 3.54 58 1.54 0 -1.39 6 0 0.85
-0.78 3.54 47 -1.39 0 -1.39 6 0 1.05
0.22 3.24 63 -1.39 0 -1.39 6 0 1.05
0.25 3.60 65 -1.39 0 -1.39 6 0 1.27
-1.35 3.60 63 1.27 0 -1.39 6 0 1.27
What are the factors responsible for prostate cancer?
library(ElemStatLearn); data(prostate)
DZ Ý Data Science MMW 2018 October 10, 2018 87 / 127
Motivating Example Regression Analysis
Consider the univariate function f ∈ C([0, 2π]) given by
f(x) =
π
2
x +
3
4
π cos
π
2
(1 + x) (15)
Simulate of an artificial iid data set D = {(xi, yi), i = 1, · · · , n}, with
n = 99 and σ = π/3
xi ∈ [0, 2π] drawn deterministically and equally spaced
Yi = f(xi) + εi
εi
iid
∼ N(0, σ2)
The R code is
f <- function(x){(pi/2)*x + (3*pi/4)*cos((pi/2)*(1+x))}
x <- seq(0, 2*pi, length=n)
y <- f(x) + rnorm(n, 0, pi/3)
DZ Ý Data Science MMW 2018 October 10, 2018 88 / 127
Motivating Example Regression Analysis
Noisy data generated with function (19)
0 1 2 3 4 5 6
0510
x
y
Question: What is the best hypothesis space to learn the underlying
function?
DZ Ý Data Science MMW 2018 October 10, 2018 89 / 127
Bias-Variance Tradeoff in Action
0 1 2 3 4 5 6
0510
x
y
(a) Underfit
0 1 2 3 4 5 6
0510
x
y
(b) Optimal fit
0510
y
DZ Ý Data Science MMW 2018 October 10, 2018 90 / 127
Introduction to Regression Analysis
We have, xi = (xi1, · · · , xip)⊤ ∈ IRp
and Yi ∈ IR, and data set
D = (x1, Y1), (x2, Y2), · · · , (xn, Yn)
We assume that the response variable Yi is related to the explanatory
vector xi through a function f via the model,
Yi = f(xi) + ξi, i = 1, · · · , n (16)
The explanatory vectors xi are fixed (non-random)
The regression function f : IRp
→ IR is unknown
The error terms ξi are iid Gaussian, i.e. ξi
iid
∼ N(0, σ2
)
Goal: We seek to estimate the function f using the data in D.
DZ Ý Data Science MMW 2018 October 10, 2018 91 / 127
Formulation of the regression problem
Let X and Y be two random variables s.t
E[Y ] = µ and E[Y 2
] < ∞
.
Goal: Find the best predictor f(X) of Y given X.
Important Questions
How does one define ”best”?
Is the very best attainable in practice?
What does the function f look like? (Function class)
How do we select a candidate from the chosen class of functions?
How hard is it computationally to find the desired function?
DZ Ý Data Science MMW 2018 October 10, 2018 92 / 127
Loss functions
1 When f(X) is used to predict Y , a loss is incurred.
Question: How is such a loss quantified?
Answer: Define a suitable loss function.
2 Common loss functions in regression
Squared error loss or (ℓ2) loss
ℓ(Y, f(X)) = (Y − f(X))2
ℓ2 is by far the most used (prevalent) because of its differentiability.
Unfortunately, not very robust to outliers.
Absolute error loss or (ℓ1) loss
ℓ(Y, f(X)) = |Y − f(X)|
ℓ1 is more robust to outliers, but not differentiable at zero.
3 Note that ℓ(Y, f(X)) is a random variable.
DZ Ý Data Science MMW 2018 October 10, 2018 93 / 127
Risk Functionals and Cost Functions
1 Definition of a risk functional,
R(f) = E[ℓ(Y, f(X))] =
X×Y
ℓ(y, f(x))pXY (x, y)dxdy
R(f) is the expected loss over all pairs of the cross space X × Y.
2 Ideally, one seeks the best out of all possible functions, i.e.,
f∗
(X) = arg min
f
R(f) = arg min
f
E[ℓ(Y, f(X))]
f∗(·) is such that
R∗
= R(f∗
) = min
f
R(f)
3 This ideal function cannot be found in practice, because the fact that
the distributions are unknown, make it impossible to form an
expression for R(f).
DZ Ý Data Science MMW 2018 October 10, 2018 94 / 127
Cost Functions and Risk Functionals
Theorem: Under regularity conditions,
f∗
(X) = E[Y |X] = arg min
f
E[(Y − f(X))2
]
Under the squared error loss, the optimal function f∗ that yields the
best prediction of Y given X is no other than the expected value of
Y given X.
Since we know neither pXY (x, y) nor pX(x), the conditional
expectation
E[Y |X] =
Y
ypY |X(y)(dy) =
Y
y
pXY (x, y)
pX(x)
dy
cannot be directly computed.
DZ Ý Data Science MMW 2018 October 10, 2018 95 / 127
Empirical Risk Minimization
Let D = (X1, Y1), (X2, Y2), · · · , (Xn, Yn) represent an iid sample
The empirical version of the risk functional is
R(f) = MSE(f) = E[(Y − f(X))2] =
1
n
n
i=1
(Yi − f(Xi))2
It turns out that R(f) provides an unbiased estimator of R(f).
We therefore seek the best by empirical standard,
ˆf∗
(X) = arg min
f
MSE(f) = arg min
f
1
n
n
i=1
(Yi − f(Xi))2
Since it is impossible to search all possible functions, it is usually
crucial to choose the ”right” function space.
DZ Ý Data Science MMW 2018 October 10, 2018 96 / 127
Function spaces
For the function estimation task for instance, one could assume that the
input space X is a closed and bounded interval of IR, i.e. X = [a, b], and
then consider estimating the dependencies between x and y from within
the space F all bounded functions on X = [a, b], i.e.,
F = {f : X → IR| ∃B ≥ 0, such that |f(x)| ≤ B, for all x ∈ X}.
One could even be more specific and make the functions of the above F
continuous, so that the space to search becomes
F = {f : [a, b] → IR| f is continuous} = C([a, b]),
which is the well-known space of all continuous functions on a closed and
bounded interval [a, b]. This is indeed a very important function space.
DZ Ý Data Science MMW 2018 October 10, 2018 97 / 127
Space of Univariate Polynomials
In fact, polynomial regression consists of searching from a function space
that is a subspace of C([a, b]). In other words, when we are doing the very
common polynomial regression, we are searching the space
P([a, b]) = {f ∈ C([a, b])| f is a polynomial with real coefficients} .
It is interesting to note that Weierstrass did prove that P([a, b]) is dense in
C([a, b]). One considers the space of all polynomial of some degree p, i.e.,
F = Pp
([a, b]) = f ∈ C([a, b])| ∃β0, β1, · · · , βp ∈ IR|
f(x) =
p
j=0
βjxj
, ∀x ∈ [a, b]



DZ Ý Data Science MMW 2018 October 10, 2018 98 / 127
Empirical Risk Minimization in F
Having chosen a class F of functions, we can now seek
ˆf(X) = arg min
f∈F
MSE(f) = arg min
f∈F
1
n
n
i=1
(Yi − f(Xi))2
We are seeking the best function in the function space chosen.
For instance, if the function space in the space of all polynomials of
degree p in some interval [a, b], finding ˆf boils down to estimating the
coefficients of the polynomial using the data, namely
ˆf(x) = ˆβ0 + ˆβ1x + ˆβ2x2
+ · · · + ˆβpxp
where using β = (β0, β1, · · · , βp)⊤, we have
ˆβ = arg min
β∈IRp+1



1
n
n
i=1

Yi −
p
j=0
βjxj
i


2


DZ Ý Data Science MMW 2018 October 10, 2018 99 / 127
Important Aspects of Statistical Learning
It is very tempting at first to use the data at hand to find/build the ˆf
that makes MSE( ˆf) is the smallest. For instance, the higher the value
of p, the smaller MSE( ˆf(·)) will get.
The estimate ˆβ = (ˆβ0, ˆβ1, · · · , ˆβp)⊤ of β = (β0, β1, · · · , βp)⊤, is a
random variable, and as a result the estimate
ˆf(x) = ˆβ0 + ˆβ1x + ˆβ2x2 + · · · + ˆβpxp of f(x) is also a random
variable.
Since ˆf(x) is random variable, we must compute important aspects
like its bias B[ ˆf(x)] = E[ ˆf(x)] − f(x) and its variance V[ ˆf(x)].
We have a dilemma: If we make ˆf complex (large p), we make the
bias small but the variance is increased. If we make ˆf simple (small
p), we make the bias large but the variance is decreased.
Most of Modern Statistical Learning is rich with model selection
techniques that seek to achieve a trade-off between bias and variance
to get the optimal model. Principle of parsimony (sparsity),
Ockham’s razor principle.
DZ Ý Data Science MMW 2018 October 10, 2018 100 / 127
Effect of Bias-Variance Dilemma of Prediction
Optimal Prediction achieved at the point of bias-variance trade-off.
DZ Ý Data Science MMW 2018 October 10, 2018 101 / 127
Theoretical Aspects of Statistical Regression Learning
Just like we have a VC bound for classification, there is one for
regression, ie when Y = IR and
ˆRn(f) =
1
n
n
i=1
|yi − f(xi)|2
= Squared error loss
Indeed, for every f ∈ F, with probability at least 1 − η, we have
R(f) ≤
ˆRn(f)
(1 − c
√
δ)+
where
δ =
a
n
v + v log
bn
v
− log
η
4
Note once again as before that these bounds are not asymptotic
Unfortunately these bounds are known to be very loose in practice.
DZ Ý Data Science MMW 2018 October 10, 2018 102 / 127
The pitfalls of memorization and overfitting
The trouble - limitation - with naively using a criterion on the whole
sample lies in the fact, given a sample (x1, y1), (x2, y2), · · · , (xn, yn), the
function ˆfmemory defined by
ˆfmemory(xi) = yi, i = 1, · · · , n
always achieves the best performance, since MSE( ˆfmemory) = 0, which is the
minimum achievable.
Where does the limitation of ˆfmemory come from? Well, ˆfmemory
does not really learn the dependency between X and Y . While it may
have some of it, it also grabs a lot of the noise in the data, and ends
overfitting the data. As a result of not really learning the structure of the
relationship between X and Y and only merely memorizing the present
sample values, ˆfmemory will predict very poorly when presented
with observations that were not in the sample.
DZ Ý Data Science MMW 2018 October 10, 2018 103 / 127
Training Set Test Set Split
Splitting the data into training set and test set: It makes
sense to judge models (functions), not on how they perform with in
sample observations, but instead how they perform on out of sample
cases. Given a collection D = (x1, y1), (x2, y2), · · · , (xn, yn) of pairs,
Randomly split D into training set of size ntr and test set of size
nte, such that ntr + nte = n
Training set
Tr = (x
(tr)
1 , y
(tr)
1 ), (x
(tr)
2 , y
(tr)
2 ), · · · , (x
(tr)
ntr , y
(tr)
ntr )
Training set
Te = (x
(te)
1 , y
(te)
1 ), (x
(te)
2 , y
(te)
2 ), · · · , (x
(te)
nte , y
(te)
nte )
DZ Ý Data Science MMW 2018 October 10, 2018 104 / 127
Training Set Test Set Split
For each function class F (linear models, nonparametrics, etc ...)
Find the best in its class based on the training set Tr
For all the estimated functions ˆf1, ˆf2, · · · , ˆfm, compute the training
error
MSETr( ˆfj) =
1
ntr
ntr
i=1
(y
(tr)
i − ˆfj(x
(tr)
i ))2
For all the estimated functions ˆf1, ˆf2, · · · , ˆfm, compute the test error
MSETe( ˆfj) =
1
nte
nte
i=1
(y
(te)
i − ˆfj(x
(te)
i ))2
Compute the averages of both MSETr and MSETe over many random
splits of the data, and tabulate (if necessary) those averages.
Select ˆfj∗ such that
mean[MSETe( ˆfj∗ )] < mean[MSETe( ˆfj)], j = 1, 2, · · · , m, j = j∗
DZ Ý Data Science MMW 2018 October 10, 2018 105 / 127
Computational Comparisons
Ideally, we would like to compare the true theoretical performances
measured by the risk functional
R(f) = E[ℓ(Y, f(X))] =
X×Y
ℓ(x, y)dP(x, y), (17)
Instead, we build the estimators using other optimality criteria, and
then compare their predictive performances using the average test
error AVTE(·), namely
AVTE(f) =
1
R
R
r=1
1
m
m
t=1
ℓ(y
(r)
it
, fr(x
(r)
it
)) , (18)
where fr(·) is the r-th realization of the estimator f(·) built using the
training portion of the split of D into training set and test set, and
x
(r)
it
, y
(r)
it
is the t-th observation from the test set at the r-th
random replication of the split of D.
DZ Ý Data Science MMW 2018 October 10, 2018 106 / 127
Learning Machines when n ≪ p
Machines Inherently designed to handle p larger than n problems
Classification and Regression Trees
Support Vector Machines
Relevance Vector Machines (n < 500)
Gaussian Process Learning Machines (n < 500)
k-Nearest Neighbors Learning Machines (Watch for the curse of
dimensionality)
Kernel Machines in general
Machines that cannot inherently handle p larger than n problems, but
can do so if regularized with suitable constraints
Multiple Linear Regression Models
Generalized Linear Models
Discriminant Analysis
Ensemble Learning Machines
Random Subspace Learning Ensembles (Random Forest)
Boosting and its extensions
DZ Ý Data Science MMW 2018 October 10, 2018 107 / 127
Motivating Example Regression Analysis
Consider the univariate function f ∈ C([−1, +1]) given by
f(x) = −x +
√
2 sin(π3/2
x2
) (19)
Simulate of an artificial iid data set D = {(xi, yi), i = 1, · · · , n}, with
n = 99 and σ = 3/10
xi ∈ [−1, +1] drawn deterministically and equally spaced
Yi = f(xi) + εi
εi
iid
∼ N(0, σ2)
The R code is
f <- function(x){-x + sqrt(2)*sin(pi^(3/2)*x^2)}
x <- seq(-1, +1, length=n)
y <- f(x) + rnorm(n, 0, 3/10)
DZ Ý Data Science MMW 2018 October 10, 2018 108 / 127
Estimation Error and Prediction Error
1
1
1
1
1
1
1
1
1
1
11
1
1
11
1
11
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
111
11
1
11
1
11
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
−1.0 −0.5 0.0 0.5 1.0
−3−2−1012
Predictive Regression with confidence and prediction bands
xnew
f(xnew)
data points
fit
lower conf
upper conf
lower pred
upper pred
Figure: Simple Orthogonal Polynomial Regression of with both confidence bands
and prediction bands on the test set. The true function is
f(x) = −x +
√
2 sin(π3/2
x2
) for x ∈ [−1, +1].
DZ Ý Data Science MMW 2018 October 10, 2018 109 / 127
Training Error and Test Error
Table: Average Training Error and Average Test Error over m = 10 random splits
of n = 300 observations generated from a population with true function
f(x) = −x +
√
2 sin(π3/2
x2
) for x ∈ [−1, +1]. The noise variance in this case is
σ2
= 0.32
. Each split has ntr= 2n/3.
Approximating Function Class
Poly SVM RVM GPR
Average
Training Error 0.0998 0.0335 0.0295 0.1861
Test Error 0.3866 0.1465 0.1481 0.1556
DZ Ý Data Science MMW 2018 October 10, 2018 110 / 127
Unsupervised Learning
DZ Ý Data Science MMW 2018 October 10, 2018 111 / 127
Finding Patterns in Job Sector Allocations in Europe
Example 1: Consider the following portion of observations on job sectors
distribution in Europe in the 1990s.
Agr Min Man PS Con SI Fin SPS TC
Italy 15.9 0.6 27.6 0.5 10.0 18.1 1.6 20.1 5.7
Poland 31.1 2.5 25.7 0.9 8.4 7.5 0.9 16.1 6.9
Rumania 34.7 2.1 30.1 0.6 8.7 5.9 1.3 11.7 5.0
USSR 23.7 1.4 25.8 0.6 9.2 6.1 0.5 23.6 9.3
Denmark 9.2 0.1 21.8 0.6 8.3 14.6 6.5 32.2 7.1
France 10.8 0.8 27.5 0.9 8.9 16.8 6.0 22.6 5.7
1 Can European countries by divided into meaningful groups (clusters)?
2 How many concepts? How many clusters (groups) of countries?
Analogy: Clustering in such an example can be thought of as unsupervised
classification (pattern recognition)
DZ Ý Data Science MMW 2018 October 10, 2018 112 / 127
Hierarchical Clustering for European Job Sector Data
One solution: Mining Job Sectors in Europe in the 1990s via Hierarchical
Clustering with Manhattan distance and ward linkage.
Belgium
UnitedKingdom
Denmark
Sweden
Netherlands
Norway
France
Finland
Italy
Luxembourg
Austria
E.Germany
W.Germany
Switzerland
Spain
Rumania
Portugal
Poland
Czechoslovakia
Bulgaria
Hungary
Ireland
USSR
Turkey
Greece
Yugoslavia
050100150200250300350
Cluster Dendrogram
hclust (*, "ward")
dist(europe, method = "manhattan")
Height
How does the distance affect the
clustering?
How does the linkage affect the
clustering?
What makes a clustering
satisfactory? How does one compare
two clusterings?
Some interesting tasks:
1 Investigate different distances with same linkage
2 Investigate different linkages with same distance
DZ Ý Data Science MMW 2018 October 10, 2018 113 / 127
Extracting Patterns of Voting in America
Example 2: Percentages of Votes given to the U. S. Republican
Presidential Candidate - 1856-1976.
X1856 X1860 X1864 X1868 X1900 X1904 X1908
Alabama NA NA NA 51.44 34.67 20.65 24.38
Arkansas NA NA NA 53.73 35.04 40.25 37.31
California 18.77 32.96 58.63 50.24 54.48 61.90 55.46
Colorado NA NA NA NA 42.04 55.27 46.88
Connecticut 53.18 53.86 51.38 51.54 56.94 58.13 59.43
Delaware 2.11 23.71 48.20 40.98 53.65 54.04 52.09
Florida NA NA NA NA 19.03 21.15 21.58
1 Can the states be grouped into clusters of republican-ness?
2 How do missing values influence the clustering?
Analogy: Again, clustering in such an example can be thought of as
unsupervised classification (pattern recognition)
DZ Ý Data Science MMW 2018 October 10, 2018 114 / 127
Example: Image Denoising
For an observed image of size r × c, posit the model
y = Wx + z. (20)
The original image is represented by a p × 1 vector, which makes the
matrix W a matrix of dimension q × p, where q = rc. We therefore have
z⊤ = (z1, · · · , zq) ∈ IRq
, x⊤ = (x1, · · · , xp) ∈ IRp
,
y⊤ = (y1, · · · , yq) ∈ IRq
.
DZ Ý Data Science MMW 2018 October 10, 2018 115 / 127
Example: Image Denoising
Expression of the solution: If E(x) = y − Wx 2 + λ x 1 is our
objective function to be minimized, and ˆx is a point at which the
minimum is achieved, then we will write
ˆx = arg min
x∈IRp
y − Wx 2
+ λ x 1 . (21)
DZ Ý Data Science MMW 2018 October 10, 2018 116 / 127
Example: Recommender System
Consider a system in which n customers have access to p different
products, like movies, clothing, rental cars, etc ...
A1 A2 · · · Aj · · · Ap
C1
C2
...
Ci w(i, j)
...
Cn
Table: Typical Representation of a Recommender System
The value of w(i, j) is the rating assigned to article Aj by customer Ci.
DZ Ý Data Science MMW 2018 October 10, 2018 117 / 127
Example: Recommender System
The main ingredient in Recommender Systems is the matrix
W =










w11 w12 · · · w1j · · · w1p
w21 w22 · · · w2j · · · w2p
...
...
...
... · · ·
...
wi1 wi2 · · · wij · · · wip
...
...
...
... · · ·
...
wn1 wn2 · · · wnj · · · wnp










The Matrix W is typical very (and I mean very) sparse, which makes
sense because people can only consume so many articles, and there
are articles some people will never consume even if some suggested.
DZ Ý Data Science MMW 2018 October 10, 2018 118 / 127
Time Series and State Space
Models
DZ Ý Data Science MMW 2018 October 10, 2018 119 / 127
IID Process and White Noise
Time
ts(X)
0 50 100 150 200
−2−10123
Time
ts(W)
0 50 100 150 200
−2−1012
(Left) White noise process (Right) IID Process.
What is the statistical model (if any) underlying the data?
DZ Ý Data Science MMW 2018 October 10, 2018 120 / 127
Random Walk in 1d and 2d
Time
ts(X)
0 50 100 150 200
−4−20246810
−10 −5 0 5 10 15 20
−20−100102030
Xt
Yt
(Left) Random walk in 1 dimension (Right) Random Walk in 2
dimensions (plane).
What is the statistical model (if any) underlying the data?
DZ Ý Data Science MMW 2018 October 10, 2018 121 / 127
Real life Time Series: Air Passengers and Sunspots
Time
AirPassengers
1950 1952 1954 1956 1958 1960
100200300400500600
Time
Sunspots
0 20 40 60 80 100
050100150
(Left) Number of airline passengers (Right) Longstanding Sunspots
data.
What is the statistical model (if any) underlying the data?
DZ Ý Data Science MMW 2018 October 10, 2018 122 / 127
Existing Computing Tools
Do the following
install.packages(’ctv’)
library{ctv}
install.views(’MachineLearning’)
install.views(’HighPerformanceComputing’)
install.views(’TimeSeries’)
install.views(’Bayesian’)
R packages for big data
library{biglm}
library(foreach)
library(glmnet)
library(kernlab)
library(randomForest)
library(ada)
library(audio)
library(rpart)
DZ Ý Data Science MMW 2018 October 10, 2018 123 / 127
Some Remarks and Recommendations
Applications: Sharpen your intuition and your commonsense by
questioning things, reading about interesting open applied problems,
and attempt to solve as many problems as possible
Methodology: Read and learn about the fundamental of statistical
estimation and inference, get acquainted with the most commonly
used methods and techniques, and consistently ask yourself and
others what the natural extensions of the techniques could be.
Computation: Learn and master at least two programming languages.
I strongly recommend getting acquainted with R
http://www.r-project.org
Theory: ”Nothing is more practical than a good theory” (Vladimir N.
Vapnik). When it comes to data mining and machine learning and
predictive analytics, those who truly understand the inner workings of
algorithms and methods always solve problems better.
DZ Ý Data Science MMW 2018 October 10, 2018 124 / 127
Machine Learning CRAN Task View in R
Let’s visit the website where most of the R community goes
http://www.r-project.org
Let’s install some packages and get started
install.packages(’ctv’)
library(ctv)
install.views(’MachineLearning’)
install.views(’HighPerformanceComputing’)
install.views(’Bayesian’)
install.views(’Robust’)
Let’s load a couple of packages and explore
library(e1071)
library(MASS)
library(kernlab)
DZ Ý Data Science MMW 2018 October 10, 2018 125 / 127
Clarke, B. and Fokou´e, E. and
Zhang, H. (2009). Principles and
Theory for Data Mining and
Machine Learning. Springer
Verlag, New York, (ISBN:
978-0-387-98134-5), (2009)
DZ Ý Data Science MMW 2018 October 10, 2018 126 / 127
References
Clarke, B., Fokou´e, E. and Zhang, H. H. (2009). Principles and
Theory for Data Mining and Machine Learning. Springer Verlag,
New York, (ISBN: 978-0-387-98134-5), (2009)
James, G, Witten, D, Hastie, T and Tibshirani, R (2013). An
Introduction to Statistical Learning with Applications in R.
Springer, New York, (e-ISBN: 978-1-4614-7138-7),(2013)
Vapnik, N. V.(1998). Statistical Learning Theory. Wiley, ISBN:
978-0-471-03003-4, (1998)
Vapnik, N. V.(2000). The Nature of Statistical Learning Theory.
Springer, ISBN 978-1-4757-3264-1, (2000)
Hastie, T. and Tibshirani, R. and Friedman, J. (2009). The
Elements of Statistical Learning: Data Mining, Inference, and
Prediction. 2nd Edition. Springer, ISBN 978-0-387-84858-7
DZ Ý Data Science MMW 2018 October 10, 2018 127 / 127

More Related Content

Similar to 2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quintessential Pillar of Modern Data Science - Ernest Fokoué, October 10, 2018

Probabilistic Modelling with Information Filtering Networks
Probabilistic Modelling with Information Filtering NetworksProbabilistic Modelling with Information Filtering Networks
Probabilistic Modelling with Information Filtering NetworksTomaso Aste
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataArthur Charpentier
 
Introduction
IntroductionIntroduction
Introductionbutest
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Editor IJARCET
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfhemangppatel
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
 
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...Quinn Lathrop
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdfHODIT12
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1MostafaHazemMostafaa
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science Frank Kienle
 

Similar to 2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quintessential Pillar of Modern Data Science - Ernest Fokoué, October 10, 2018 (20)

Classification
ClassificationClassification
Classification
 
Probabilistic Modelling with Information Filtering Networks
Probabilistic Modelling with Information Filtering NetworksProbabilistic Modelling with Information Filtering Networks
Probabilistic Modelling with Information Filtering Networks
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big data
 
2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...
2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...
2019 Fall Series: Postdoc Seminars - Special Guest Lecture, There is a Kernel...
 
Introduction
IntroductionIntroduction
Introduction
 
Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582Ijarcet vol-2-issue-4-1579-1582
Ijarcet vol-2-issue-4-1579-1582
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
CLIM: Transition Workshop - A Notional Framework for a Theory of Data Systems...
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
 
Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...
QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...
QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
SASA 2016
SASA 2016SASA 2016
SASA 2016
 
Machine Learning part 2 - Introduction to Data Science
Machine Learning part 2 -  Introduction to Data Science Machine Learning part 2 -  Introduction to Data Science
Machine Learning part 2 - Introduction to Data Science
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 

Recently uploaded

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 

Recently uploaded (20)

MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 

2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quintessential Pillar of Modern Data Science - Ernest Fokoué, October 10, 2018

  • 1. Foundations of Statistical Learning Theory Quintessential Pillar of Modern Data Science Ernest Fokou´e DZ Ý School of Mathematical Sciences Rochester Institute of Technology Rochester, New York, USA Delivered by invitation of the Statistical and Mathematical Sciences Institute (SAMSI) Modern Mathematics Workshop (MMW 2018) San Antonio, Texas, USA October 10, 2018 DZ Ý Data Science MMW 2018 October 10, 2018 1 / 127
  • 2. Acknowledgments I wish to express my grateful thanks and sincere gratitude to the Director of SAMSI, Prof. Dr. David Banks, for kindly inviting me and granting me the golden opportunity to present at the 2018 Modern Mathematics Workshop in San Antonio. I hope and pray that my modest contribution will inspire and empower all the attendees of my mini course. DZ Ý Data Science MMW 2018 October 10, 2018 2 / 127
  • 3. Basic Introduction to Statistical Machine Learning Roadmap: This lecture will provide you with the basic elements of an introduction to the foundational concepts of statistical machine learning. Among other things, we’ll touch on foundational concepts such as: Input space, output space, function space, hypothesis space, loss function, risk functional, theoretical risk, empirical risk, Bayes Risk, training set, test set, model complexity, generalization error, approximation error, Estimation error, bounds on the generalization error, regularization, etc ... Relevant websites http://www.econ.upf.edu/∼lugosi/mlss slt.pdf https://en.wikipedia.org/wiki/Reproducing kernel Hilbert space Kernel Machines http://www.kernel-machines.org/ R Software project website: http://www.r-project.org DZ Ý Data Science MMW 2018 October 10, 2018 3 / 127
  • 4. Traditional Pattern Recognition Applications Statistical Machine Learning Methods and Techniques have been successfully applied to wide variety of important fields. Amongst others: 1 The famous and somewhat ubiquitous handwritten digit recognition. This data set is also known as MNIST, and is usually the first task in some Data Analytics competitions. This data set is from USPS and was first made popular by Yann LeCun, the co-inventor of Deep Learning. 2 More recently, text mining and specific topic of text categorization/classification has made successful use of statistical machine learning. 3 Credit Scoring is another application that has been connected with statistical machine learning 4 Disease diagnostics has also been tackled using statistical machine learning Other applications include: audio processing, speaker recognition and speaker identification. DZ Ý Data Science MMW 2018 October 10, 2018 4 / 127
  • 5. Handwritten Digit Recognition Handwritten digit recognition is a fascinating problem that captured the attention of the machine learning and neural network community for many years, and has remained a benchmark problem in the field. 0 1:28 1 1:28 2 1:28 3 1:28 4 1:28 5 1:28 6 1:28 7 1:28 8 1:28 9 1:28 DZ Ý Data Science MMW 2018 October 10, 2018 5 / 127
  • 6. Handwritten Digit Recognition Below is a portion of the benchmark training set Note: The challenge here is building classification techniques that accurately classify handwritten digits taken from the test set. DZ Ý Data Science MMW 2018 October 10, 2018 6 / 127
  • 7. Handwritten Digit Recognition Below is a portion of the benchmark training set Note: The challenge here is building classification techniques that accurately classify handwritten digits taken from the test set. DZ Ý Data Science MMW 2018 October 10, 2018 7 / 127
  • 8. Handwritten Digit Recognition Below is a portion of the benchmark training set Note: The challenge here is building classification techniques that accurately classify handwritten digits taken from the test set. DZ Ý Data Science MMW 2018 October 10, 2018 8 / 127
  • 9. Pattern Recognition (Classification) data set pregnant glucose pressure triceps insulin mass pedigree age diabetes 6 148 72 35 0 33.60 0.63 50 pos 1 85 66 29 0 26.60 0.35 31 neg 8 183 64 0 0 23.30 0.67 32 pos 1 89 66 23 94 28.10 0.17 21 neg 0 137 40 35 168 43.10 2.29 33 pos 5 116 74 0 0 25.60 0.20 30 neg 3 78 50 32 88 31.00 0.25 26 pos 10 115 0 0 0 35.30 0.13 29 neg 2 197 70 45 543 30.50 0.16 53 pos 8 125 96 0 0 0.00 0.23 54 pos 4 110 92 0 0 37.60 0.19 30 neg 10 168 74 0 0 38.00 0.54 34 pos 10 139 80 0 0 27.10 1.44 57 neg 1 189 60 23 846 30.10 0.40 59 pos What are the factors responsible for diabetes? library(mlbench); data(PimaIndiansDiabetes) DZ Ý Data Science MMW 2018 October 10, 2018 9 / 127
  • 10. Pattern Recognition (Classification) data set X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 Class 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 n 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 n 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 n 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 ei 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 ie 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 ie 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 ei 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 n 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 n 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 n 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 ie 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 n 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 ie What are the indicators that control of promoter genes in the DNA? library(mlbench); data(DNA) DZ Ý Data Science MMW 2018 October 10, 2018 10 / 127
  • 11. Pattern Recognition (Classification) data set Class X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 x11 X12 X13 X14 + g c c t t c t c c a a a a c + a t g c a a t t t t t t a g + c c g t t t a t t t t t t c + t c t c a a c g t a a c a c + t a g g c a c c c c a g g c + a t a t a a a a a a g t t c + c a a g g t a g a a t g c t + t t a g c g g a t c c t a c + c t g c a a t t t t t c t a + t g t a a a c t a a t g c c + c a c t a a t t t a t t c c + a g g g g c a a g g a g g a + c c a t c a a a a a a a t a + a t g c a t t t t t c c g c + t c a g a a a t a t t a t g What are the indicators that control of promoter genes in the DNA? library(kernlab); data(promotergene) DZ Ý Data Science MMW 2018 October 10, 2018 11 / 127
  • 12. Statistical Speaker Accent Recognition Consider Xi = (xi1, · · · , xip)⊤ ∈ Rp and Yi ∈ {−1, +1}, and the set D = (X1, Y1), (X2, Y2), · · · , (Xn, Yn) where Yi = +1 if person i is a Native US −1 if person i is a Non Native US and Xi = (xi1, · · · , xip)⊤ ∈ Rp is the time domain representation of his/her reading of an English sentence. The design matrix is X =         x11 x12 · · · · · · · · · · · · x1j · · · x1p ... ... ... ... · · · · · · · · · · · · ... xi1 xi2 · · · · · · · · · · · · xij · · · xip ... ... ... ... · · · · · · · · · · · · ... xn1 xn2 · · · · · · · · · · · · xnj · · · xnp         DZ Ý Data Science MMW 2018 October 10, 2018 12 / 127
  • 13. Statistical Speaker Accent Recognition Consider this design matrix X =         x11 x12 · · · · · · · · · · · · x1j · · · x1p ... ... ... ... · · · · · · · · · · · · ... xi1 xi2 · · · · · · · · · · · · xij · · · xip ... ... ... ... · · · · · · · · · · · · ... xn1 xn2 · · · · · · · · · · · · xnj · · · xnp         At RIT, we recently collected voices from n = 117 people. Each sentence required about 11 seconds to be read. At a sampling rate of 441000 Hz, each sentence requires a vector of dimension roughly p=540000 in the time domain. We therefore have a gravely underdetermined system with X ∈ IRn×p where n ≪ p. Here, n=117 and p=540000. DZ Ý Data Science MMW 2018 October 10, 2018 13 / 127
  • 14. Binary Classification in the Plane, X ⊂ R2 Given {(x1, y1), · · · , (xn, yn)}, with xi ∈ X ⊂ R2 and yi ∈ {−1, +1} −20 −10 0 10 −20−15−10−5051015 x1 x2 What is the ”best” classifier f∗ that separates the red from the green? DZ Ý Data Science MMW 2018 October 10, 2018 14 / 127
  • 15. Motivating Binary Classification in the Plane For the binary classification problem introduced earlier: – A collection {(x1, y1), · · · , (xn, yn)} of i.i.d. observations is given xi ∈ X ⊂ Rp , i = 1, · · · , n. X is the input space. yi ∈ {−1, +1}. Y = {−1, +1} is the output space. – What is the probability law that governs the (xi, yi)’s? – What is the functional relationship between x and y? Namely one considers mappings f : X → Y x → f(x), – What is the ”best” approach to determining from the available observations, the relationship f between x and y in such a way that, given a new (unseen) observation xnew, its class ynew can be predicted by f(xnew) as accurately and precisely as possible, that is, with the smallest possible discrepancy. DZ Ý Data Science MMW 2018 October 10, 2018 15 / 127
  • 16. Basic Remarks on Classification While some points clearly belong to one of the classes, there are other points that are either strangers in a foreign land, or are positioned in such a way that no automatic classification rule can clearly determine their class membership. One can construct a classification rule that puts all the points in their corresponding classes. Such a rule would prove disastrous in classifying new observations not present in the current collection of observations. Indeed, we have a collection of pairs (xi, yi) of observations coming from some unknown distribution P(x, y). DZ Ý Data Science MMW 2018 October 10, 2018 16 / 127
  • 17. Basic Remarks on Classification Finding an automatic classification rule that achieves the absolute very best on the present data is not enough since infinitely many more observations can be generated by P(x, y) for which good classification will be required. Even the universally best classifier will make mistakes. Of all the functions in YX , it is reasonable to assume that there is a function f∗ that maps any x ∈ X to its corresponding y ∈ Y, i.e., f∗ : X → Y x → f∗(x), with the minimum number of mistakes. DZ Ý Data Science MMW 2018 October 10, 2018 17 / 127
  • 18. Theoretical Risk Minimization Let f denote any generic function mapping an element x of X to its corresponding image f(x) in Y. Each time x is drawn from P(x), the disagreement between the image f(x) and the true image y is called the loss, denoted by ℓ(y, f(x)). The expected value of this loss function with respect to the distribution P(x, y) is called the risk functional of f. Generically, we shall denote the risk functional of f by R(f), so that R(f) = E[ℓ(Y, f(X))] = ℓ(y, f(x))dP(x, y). The best function f∗ over the space YX of all measurable functions from X to Y is therefore f∗ = arg inf f R(f), so that R(f∗ ) = R∗ = inf f R(f). DZ Ý Data Science MMW 2018 October 10, 2018 18 / 127
  • 19. On the need to reduce the search space Unfortunately, f∗ can only be found if P(x, y) is known. Therefore, since we do not know P(x, y) in practice, it is hopeless to determine f∗. Besides, trying to find f∗ without the knowledge of P(x, y) implies having to search the infinite dimensional function space YX of all mappings from X to Y, which is an ill-posed and computationally nasty problem. Throughout this lecture, we will seek to solve the more reasonable problem of choosing from a function space F ⊂ YX , the one function f· ∈ F that best estimates the dependencies between x and y. It is therefore important to define what is meant by best estimates. For that, the concepts of loss function and risk functional need to be define. DZ Ý Data Science MMW 2018 October 10, 2018 19 / 127
  • 20. Loss and Risk in Pattern Recognition For this classification/pattern recognition, the so-called 0-1 loss function defined below is used. More specifically, ℓ(y, f(x)) = 1{Y =f(X)} = 0 if y = f(x), 1 if y = f(x). (1) The corresponding risk functional is R(f) = ℓ(y, f(x))dP(x, y) = E 1{Y =f(X)} = Pr (X,Y )∼P [Y = f(X)]. The minimizer of the 0-1 risk functional over all possible classifiers is the so-called Bayes classifier which we shall denote here by f∗ given by f∗ = arg inf f Pr (X,Y )∼P [Y = f(X)] . Specifically, the Bayes’ classifier f∗ is given by the posterior probability of class membership, namely f∗ (x) = arg max y∈Y Pr[Y = y|x] . DZ Ý Data Science MMW 2018 October 10, 2018 20 / 127
  • 21. Bayes Learner for known situations If p(x|y = +1) = MVN(x, µ+1, Σ) and p(x|y = −1) = MVN(x, µ−1, Σ), the Bayes classifier f∗, the classifier that achieves the Bayes risk, coincides with the population Linear Discriminant Analysis (LDA), fLDA, which, for any new point x, yields the predicted class f∗ (x) = fLDA(x) = sign β0 + β⊤ x , where β = Σ−1 (µ+1 − µ−1), and β0 = − 1 2 (µ+1 + µ−1)⊤ Σ−1 (µ+1 − µ−1) + log π+1 π−1 , with π+1 = Pr[Y = +1] and π−1 = 1 − π+1 representing the prior probabilities of class membership. DZ Ý Data Science MMW 2018 October 10, 2018 21 / 127
  • 22. Bayes Risk for known situations î Bayes Risk in Binary Classification under Gaussian Class Conditional Densities with common covariance matrix: Let x = (x1, x2, · · · , xp)⊤ be a p-dimensional vector coming from either class +1 or class −1. Let f be a function (classifier) that seeks to map x to y ∈ {−1, +1} as accurately as possible. Let R∗ = min f {Pr[f(X) = Y ]} be the Bayes Risk, i.e. the smallest error rate among all possible f. If p(x|y = +1) = MVN(x, µ+1, Σ) and p(x|y = −1) = MVN(x, µ−1, Σ), then R∗ = R(f∗ ) = Φ(− √ ∆/2) = − √ ∆/2 −∞ 1 √ 2π e− 1 2 z2 dz, with ∆ = (µ+1 − µ−1)⊤ Σ−1 (µ+1 − µ−1). DZ Ý Data Science MMW 2018 October 10, 2018 22 / 127
  • 23. Loss Functions for Classification With f : X −→ {−1, +1}, and h ∈ H such that f(x) = sign(h(x)) Zero-one (0/1) loss ℓ(y, f(x)) = 1(y = f(x)) = 1(yh(x) < 0) Hinge loss ℓ(y, f(x)) = max(1 − yh(x), 0) = (1 − yh(x))+ Logistic loss ℓ(y, f(x)) = log(1 + exp(−yh(x))) Exponential loss ℓ(y, f(x)) = exp(−yh(x)) DZ Ý Data Science MMW 2018 October 10, 2018 23 / 127
  • 24. Loss Functions for Classification With f : X −→ {−1, +1}, and h ∈ H such that f(x) = sign(h(x)) Zero-one (0/1) loss ℓ(y, f(x)) = 1(yh(x) < 0) Hinge loss ℓ(y, f(x)) = max(1 − yh(x), 0) Logistic loss ℓ(y, f(x)) = log(1 + exp(−yh(x))) Exponential loss ℓ(y, f(x)) = exp(−yh(x)) −3 −2 −1 0 1 2 3 01234 yh(x) δ(yh(x)) hinge loss squared loss logistic loss exponential zero−one loss DZ Ý Data Science MMW 2018 October 10, 2018 24 / 127
  • 25. Loss Functions for Regression With f : X −→ IR, and f ∈ H. ℓ1 loss ℓ(y, f(x)) = |y − f(x)| ℓ2 loss ℓ(y, f(x)) = |y − f(x)|2 ε-insensitive ℓ1 loss ℓ(y, f(x)) = |y − f(x)| − ε ε-insensitive ℓ2 loss ℓ(y, f(x)) = |y − f(x)|2 − ε −3 −2 −1 0 1 2 3 0.00.51.01.52.0 y − f(x) l(y,f(x)) epsi−l1 loss epsi−l2 loss squared loss absolute loss DZ Ý Data Science MMW 2018 October 10, 2018 25 / 127
  • 26. Function Class in Pattern Recognition As stated earlier, trying to find f∗ is hopeless. One needs to select a function space F ⊂ YX , and then choose the best estimator f+ from F, i.e., f+ = arg inf f∈F R(f), so that R(f+ ) = R+ = inf f∈F R(f). For the binary pattern recognition problem, one may consider finding the best linear separating hyperplane, i.e. F = f : X → {−1, +1}| ∃α0 ∈ R, (α1, · · · , αp)⊤ = α ∈ Rp | f(x) = sign α⊤ x + α0 , ∀x ∈ X DZ Ý Data Science MMW 2018 October 10, 2018 26 / 127
  • 27. Empirical Risk Minimization Let D = (X1, Y1), · · · , (Xn, Yn) be an iid sample from P(x, y). The empirical version of the risk functional is R(f) = 1 n n i=1 1{Yi=f(Xi)} We therefore seek the best by empirical standard, f = arg min f∈F 1 n n i=1 1{Yi=f(Xi)} Since it is impossible to search all possible functions, it is usually crucial to choose the ”right” function space F. DZ Ý Data Science MMW 2018 October 10, 2018 27 / 127
  • 28. Bias-Variance Trade-Off In traditional statistical estimation, one needs to address at the very least issues like: (a) the Bias of the estimator; (b) the Variance of the estimator; (c) The consistency of the estimator; Recall from elementary point estimation that, if θ is the true value of the parameter to be estimated, and θ is a point estimator of θ, then one can decompose the total error as follows: θ − θ = θ − E[θ] Estimation error + E[θ] − θ Bias (2) Under the squared error loss, one seeks θ that minimizes the mean squared error, θ = arg min θ∈Θ E[(θ − θ)2 ] = arg min θ∈Θ MSE(θ), rather than trying to find the minimum variance unbiased estimator (MVUE). DZ Ý Data Science MMW 2018 October 10, 2018 28 / 127
  • 29. Bias-Variance Trade-off Clearly, the traditional so-called bias-variance decomposition of the MSE reveals the need for bias-variance trade-off. Indeed, MSE(θ) = E[(θ − θ)2 ] = E[(θ − E[θ])2 ] + E[(E[θ] − θ)2 ] = variance + bias2 If the estimator θ were to be sought from all possible value of θ, then it might make sense to hope for the MVUE. Unfortunately - an especially in function estimation as we clearly argued earlier - there will be some bias, so that the error one gets has a bias component along with the variance component in the squared error loss case. If the bias is too small, then an estimator with a larger variance is obtained. Similarly, a small variance will tend to come from estimators with a relatively large bias. The best compromise is then to trade-off bias and variance. Which is in functional terms translates into trade-off between approximation error and estimation error. DZ Ý Data Science MMW 2018 October 10, 2018 29 / 127
  • 30. Bias-Variance Trade-off Optimal Smoothing Less smoothing Bias squared True Risk More smoothing Variance Figure: Illustration of the qualitative behavior of the dependence of bias versus variance on a tradeoff parameter such as λ or h. For small values the variability is too high; for large values the bias gets large. DZ Ý Data Science MMW 2018 October 10, 2018 30 / 127
  • 31. Structural risk minimization principle Since making the estimator of the function arbitrarily complex causes the problems mentioned earlier, the intuition for a trade-off reveals that instead of minimizing the empirical risk Rn(f) one should do the following: Choose a collection of function spaces {Fk : k = 1, 2, · · · }, maybe a collection of nested spaces (increasing in size) Minimize the empirical risk in each class Minimize the penalized empirical risk min k min f∈Fk Rn(f) + penalty(k, n) where penalty(k, n) gives preference to models with small estimation error. It is important to note that penalty(k, n) measures the capacity of the function class Fk. The widely used technique of regularization for solving ill-posed problem is a particular instance of structural risk minimization. DZ Ý Data Science MMW 2018 October 10, 2018 31 / 127
  • 32. Regularization for Complexity Control Tikhonov’s Variation Approach to Regularization[Tikhonov, 1963] Find f that minimizes the functional R(reg) n (f) = 1 n n i=1 ℓ(yi, f(xi)) + λΩ(f) where λ > 0 is some predefined constant. Ivanov’s Quasi-solution Approach to Regularization[Ivanov, 1962] Find f that minimizes the functional Rn(f) = 1 n n i=1 ℓ(yi, f(xi)) subject to the constraint Ω(f) ≤ C where C > 0 is some predefined constant. DZ Ý Data Science MMW 2018 October 10, 2018 32 / 127
  • 33. Regularization for Complexity Control Philips’ Residual Approach to Regularization[Philips, 1962] Find f that minimizes the functional Ω(f) subject to the constraint 1 n n i=1 ℓ(yi, f(xi)) ≤ µ where µ > 0 is some predefined constant. In all the above, the functional Ω(f) is called the regularization functional. Ω(f) is defined in such a way that it controls the complexity of the function f. Ω(f) = f 2 = b a (f′′ (t))2 dt. is a regularization functional used in spline smoothing. DZ Ý Data Science MMW 2018 October 10, 2018 33 / 127
  • 34. Support Vector Machines and the Hinge Loss Let’s consider h(x) = w⊤x + b, w ∈ IRp , b ∈ IR and the classifier f(x) = sign(h(x)) = sign(w⊤ x + b). Recall the hinge loss defined as ℓ(y, f(x)) = (1−yh(x))+ = 0 if yh(x) > 0 correct prediction 1 − yh(x) if yh(x) < 0 wrong prediction −4 −2 0 2 4 012345 yf(x) hinge(y,f(x)) DZ Ý Data Science MMW 2018 October 10, 2018 34 / 127
  • 35. Support Vector Machines and the Hinge Loss The Support Vector Machine classifier can be formulated as Minimize E(w, b) = 1 n n i=1 (1 − yi(w⊤ xi + b))+ subject to w 2 2 < τ. Which is equivalent in regularized (lagrangian) form to (w, b) = arg min w∈Rq 1 n n i=1 (1 − yi(w⊤ xi + b))+ + λ w 2 2 The SVM linear binary classification estimator is given by fn(x) = sign(h(x)) = sign(w⊤ x + b) where w and b are estimators of w and b respectively. DZ Ý Data Science MMW 2018 October 10, 2018 35 / 127
  • 36. Classification realized with Linear Boundary SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1 Figure: Linear SVM classifier with a relatively small margin DZ Ý Data Science MMW 2018 October 10, 2018 36 / 127
  • 37. Classification realized with Linear Boundary SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1 Figure: Linear SVM classifier with a relatively large margin DZ Ý Data Science MMW 2018 October 10, 2018 37 / 127
  • 38. SVM Learning via Quadratic Programming When the decision boundary is nonlinear, the αi’s in the expression of the support vector machine classifier ˆf are determined by solving the following quadratic programming problem Maximize E(α) = n i=1 αi − 1 2 n i=1 n j=1 αiαjyiyjK(xi, xj). subject to 0 ≤ αi ≤ C (i = 1, · · · , n) and n i=1 αiyi = 0. The above formulation is an instance of the general QP Maximize − 1 2 α⊤ Qα + 1⊤ α subject to α⊤ y = 0 and αi ∈ [0, C], ∀i ∈ [n]. n×nDZ Ý Data Science MMW 2018 October 10, 2018 38 / 127
  • 39. SVM Learning via Quadratic Programming in R The quadratic programming problem Maximize − 1 2 α⊤ Qα + 1⊤ α subject to α⊤y = 0 and αi ∈ [0, C], ∀i ∈ [n]. is equivalent to Minimize 1 2 α⊤ Qα − 1⊤ α subject to α⊤y = 0 and αi ∈ [0, C], ∀i ∈ [n]. Which is solved with the R package kernlab via the function ipop() Minimize c⊤ α + 1 2 α⊤ Hα subject to b ≤ Aα ≤ b + r and l ≤ α ≤ u. DZ Ý Data Science MMW 2018 October 10, 2018 39 / 127
  • 40. Support Vector Machines and Kernels As a result of the kernelization, the SVM classifier delivers for each x, the estimated response fn(x) = sign   |s| j=1 ˆαsj ysj K(xsj , x) + ˆb   where sj ∈ {1, 2, · · · , n}, s = {s1, s2, · · · , s|s|} and |s| ≪ n. The kernel K(·, ·) is a bivariate function K : X × X −→ IR+ such that given xl, xm ∈ X, the value of K(xl, xm) = Φ(xl), Φ(xm) = Φ(xl)⊤ Φ(xm) represents the similarity between xl and xm, and corresponds to an implicit inner product in some feature space Z of dimension higher than dim(X), where the decision boundary is conveniently a large margin separating hyperplane. Trick: There is never any need in practice to explicitly manipulated the higher dimensional feature mapping Φ : X −→ Z. DZ Ý Data Science MMW 2018 October 10, 2018 40 / 127
  • 41. Classification realized with Nonlinear Boundary SVM Optimal Separating and Margin Hyperplanes Figure: Nonlinear SVM classifier with a relatively small margin DZ Ý Data Science MMW 2018 October 10, 2018 41 / 127
  • 42. Interplay between the aspects of statistical learning DZ Ý Data Science MMW 2018 October 10, 2018 42 / 127
  • 43. Statistical Consistency Definition: Let θn be an estimator of some scalar quantity θ based on an i.i.d. sample X1, X2, · · · , Xn from the distribution with parameter θ. Then, θn is said to be a consistent estimator of θ, if θn converges in probability to θ, i.e., θn P −→ n→∞ θ. In other words, θn is a consistent estimator of θ if, ∀ǫ > 0, lim n→∞ Pr |θn − θ| > ǫ = 0. It turns out that for unbiased estimators θn, consistency is straightforward as direct consequence of a basic probabilistic inequality like Chebyshev’s inequality. However, for unbiased estimators, one has to be more careful. DZ Ý Data Science MMW 2018 October 10, 2018 43 / 127
  • 44. A Basic Important Inequality ê¦ (Biename-Chebyshev’s inequality) Let X be a random variable with finite mean µX = E[X] i.e. |E[X]| < +∞ and finite variance σ2 X = V(X) , i.e., |V(X)| < +∞. Then, ∀ǫ > 0, Pr[|X − E[X]| > ǫ] ≤ V(X) ǫ2 . It is therefore easy to see here that, with unbiased θn, one has E[θn] = θ, and the result is immediate. For the sake of clarity, let’s recall here the elementary weak law of large numbers. DZ Ý Data Science MMW 2018 October 10, 2018 44 / 127
  • 45. Weak Law of Large Numbers Let X be a random variable with finite mean µX = E[X] i.e. |E[X]| < +∞ and finite variance σ2 X = V(X) , i.e., |V(X)| < +∞. Let X1, X2, · · · , Xn be a random sample of n observations drawn independently from the distribution of X, so that for i = 1, · · · , n, we have E[Xi] = µ and V[Xi] = σ2 . Let ¯Xn be the sample mean, i.e., ¯Xn = 1 n (X1 + X2 + · · · + Xn) = 1 n n i=1 Xi Then, clearly, E[ ¯Xn] = µ, and, ∀ǫ > 0, lim n→∞ Pr[| ¯Xn − µ| > ǫ] = 0. (3) This essentially expresses the fact that the empirical mean ¯Xn converges in probability to the theoretical mean µ in the limit of very large samples. DZ Ý Data Science MMW 2018 October 10, 2018 45 / 127
  • 46. Weak Law of Large Numbers We therefore have ¯Xn P −→ n→∞ µ. With µ ¯X = E[ ¯Xn] = µ and σ2 ¯X = σ2/n, one applyies Biename-Chebyshev’s inequality and gets: ∀ǫ > 0, Pr[| ¯X − µ| > ǫ] ≤ σ2 nǫ2 , (4) which, by inversion, is the same as | ¯X − µ| < 1 δ σ2 n (5) with probability at least 1 − δ. Why is all the above of any interest to statistical learning theory? DZ Ý Data Science MMW 2018 October 10, 2018 46 / 127
  • 47. Weak Law of Large Numbers Why is all the above of any interest to statistical learning theory? Equation (3) states the much needed consistency of ¯X as an estimator of µ. Equation (4), by showing the dependence of on n and ε helps assess the rate at which ¯X converges to µ. Equation (5), by showing a confidence interval helps compute bounds on the unknown true mean µ as a function of the empirical mean ¯X and the confidence level 1 − δ. Finally, how does go about constructing estimators with all the above properties. DZ Ý Data Science MMW 2018 October 10, 2018 47 / 127
  • 48. Effect of Bias-Variance Dilemma of Prediction Optimal Prediction achieved at the point of bias-variance trade-off. DZ Ý Data Science MMW 2018 October 10, 2018 48 / 127
  • 49. Theoretical Aspects of Statistical Learning For binary classification using the so-called 0/1 loss function, the Vapnik-Chervonenkis inequality takes the form P sup f∈F | ˆRn(f) − R(f)| > ε ≤ 8S(F, n)e−nε2/32 (6) which is also expression in terms of expectation as E sup f∈F | ˆRn(f) − R(f)| ≤ 2 log S(F, n) + log 2 n (7) The quantity S(F, n) plays an important role of the CV Theory and will explored in greater details later. Note that these bounds including the one presented earlier in the VC Fundamental Machine Learning Theorem are not asymptotic bounds. They hold for any n. The bounds are nice and easy if h or S(F, n) is known. Unfortunately the bound may exceed 1, making it useless. DZ Ý Data Science MMW 2018 October 10, 2018 49 / 127
  • 50. Components of Statistical Machine Learning Interestingly, all those 4 components of classical estimation theory, will be encountered again in statistical learning theory. Essentially, the 4 components of statistical learning theory consist of finding the answers to the following questions: (a) What are the necessary and sufficient conditions for the consistency of a learning process based on the ERM principle? This leads to the Theory of consistency of learning processes. (b) How fast is the rate of convergence of the learning process? This leads to the Nonasymptotic theory of the rate of convergence of learning processes; (c) How can one control the rate of convergence (the generalization ability) of the learning process?. This leads to the Theory of controlling the generalization ability of learning processes; (d) How can one construct algorithms that can control the generalization ability of the learning process?. This leads to Theory of constructing learning algorithms. DZ Ý Data Science MMW 2018 October 10, 2018 50 / 127
  • 51. Error Decomposition revisited A reasoning on error decomposition and consistency of estimators along with rates, bounds and algorithms applies to function spaces: indeed, the difference between the true risk R(fn) associated with fn and the overall minimum risk R∗ can be decomposed to explore in greater details the source of error in the function estimation process: R(fn) − R∗ = R(fn) − R(f+ ) Estimation error + R(f+ ) − R∗ Approximation error (8) A reasoning similar to bias-variance trade-off and consistency can be made, with the added complication brought be the need to distinguish between the true risk functional and the empirical risk functional, and also to the added to assess both pointwise behaviors and uniform behaviors. In a sense, one needs to generalize the decomposition and the law of large numbers to function spaces. DZ Ý Data Science MMW 2018 October 10, 2018 51 / 127
  • 52. Approximation-Estimation Trade-Off Optimal Smoothing Less smoothing Bias squared True Risk More smoothing Variance Figure: Illustration of the qualitative behavior of the dependence of bias versus variance on a tradeoff parameter such as λ or h. For small values the variability is too high; for large values the bias gets large. DZ Ý Data Science MMW 2018 October 10, 2018 52 / 127
  • 53. Consistency of the Empirical Risk Minimization principle The ERM principle is consistent if it provides a sequence of functions ˆfn, n = 1, 2, · · · for which both the expected risk R(fn) and the empirical risk Rn(fn) converge to the minimal possible value of the risk R(f+) in the function class under consideration, i.e., R( ˆfn) P −→ n→∞ inf f∈F R(f) = R(f+ ) and Rn( ˆfn) P −→ n→∞ inf f∈F R(f) = R(f+ ) Vapnik discusses the details of this theorem at length, and extends the exploration to include the difference between what he calls trivial consistency and non-trivial consistency. DZ Ý Data Science MMW 2018 October 10, 2018 53 / 127
  • 54. Consistency of the Empirical Risk Minimization principle To better understand consistency in function spaces, consider the sequence of random variables ξn = sup f∈F R(f) − Rn(f) , (9) and consider studying lim n→∞ P sup f∈F R(f) − Rn(f) > ε = 0, ∀ε > 0. Vapnik shows that the sequence of the means of the random variable ξn converges to zero as the number n of observations increases. He also remarks that the sequence of random variables ξn converges in probability to zero if the set of functions F, contains a finite number m of elements. We will show that later in the case of pattern recognition. DZ Ý Data Science MMW 2018 October 10, 2018 54 / 127
  • 55. Consistency of the Empirical Risk Minimization principle It remains then to describe the properties of the set of functions F, and probability measure P(x, y) under which the sequence of random variables ξn converges in probability to zero. lim n→∞ P sup f∈F [R(f) − Rn(f)] > ε or sup f∈F [Rn(f) − R(f)] > ε = 0. Recall that Rn(f) is the realized disagreement between classifier f and the truth about the label y of x based on information contained in the sample D. It is easy to see that, for a given (fixed) function (classifier) f, E[Rn(f)] = R(f). (10) Note that while this pointwise unbiasedness of the empirical risk is a good bottomline property to have, it is not enough. More is needed as the comparison is against R(f+) or event better yet R(f∗). DZ Ý Data Science MMW 2018 October 10, 2018 55 / 127
  • 56. Consistency of the Empirical Risk Remember that the goal of statistical function estimation is to devise a technique (strategy) that chooses from the function class F, the one function whose true risk is as close as possible to the lowest risk in class F. The question arises: since one cannot calculate the true error, how can one devise a learning strategy for choosing classifiers based on it? Tentative answer: At least devise strategies that yield functions for which the upper bound on the theoretical risk is as tight as possible, so that one can make confidence statements of the form: With probability 1 − δ over an i.i.d. draw of some sample according to the distribution P, the expected future error rate of some classifier is bounded by a function g(δ, error rate on sample) of δ and the error rate on sample. Pr TestError ≤ TrainError + φ(n, δ, κ(F)) ≥ 1 − δ DZ Ý Data Science MMW 2018 October 10, 2018 56 / 127
  • 57. Foundation Result in Statistical Learning Theory Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of functions implementing so learning machines, and let ζ = V Cdim(F) be the VC dimension of F. Let the theoretical and the empirical risks be defined as earlier and consider any data distribution in the population of interest. Then ∀f ∈ F, the prediction error (theoretical risk) is bounded by R(f) ≤ ˆRn(f) + ζ log 2n ζ + 1 − log η 4 n (11) with probability of at least 1 − η. or Pr TestError ≤ TrainError + ζ log 2n ζ + 1 − log η 4 n ≥ 1 − η DZ Ý Data Science MMW 2018 October 10, 2018 57 / 127
  • 58. Optimism of the Training Error 5 10 15 20 0.000.050.100.15 Complexity Predictionerror E[Training Error] E[Test Error] DZ Ý Data Science MMW 2018 October 10, 2018 58 / 127
  • 59. Bounds on the Generalization Error For instance, using Chebyshev’s inequality and the fact that E[Rn(f)] = R(f), it is easy to see that, for given classifier f and a sample D = {(x1, y1), · · · , (xn, yn)}, Pr[|Rn(f) − R(f)| > ǫ] ≤ R(f)(1 − R(f)) nǫ2 . To estimate the true but unknown error R(f) with a probability of at least 1 − δ, it makes sense to use inversion, i.e., set δ = R(f)(1 − R(f)) nǫ2 , so that ǫ = R(f)(1 − R(f)) nδ . Owing to the fact that max R(f)∈[0,1] R(f)(1 − R(f)) = 1 4 , we have R(f)(1 − R(f)) nδ < 1 4nδ = 1 4nδ 1/2 . DZ Ý Data Science MMW 2018 October 10, 2018 59 / 127
  • 60. Bounds on the Generalization Error Based on Chebyshev’s inequality, for a given classifier f, with a probability of at least 1 − δ, the bound on the difference between the true risk R(f) and the empirical risk Rn(f) is given by |Rn(f) − R(f)| < 1 4nδ 1/2 . Recall that one of the goals of statistical learning theory is to assess the rate of convergence of the empirical risk to the true risk, which translates into assessing how tight the corresponding bounds on the true risk are. In fact, it turns out many bounds can be so loose as to become useless. It turns out that the above Chebyshev-based bound is not a good one, at least compared to bounds obtained using the so-called hoeffding’s inequality. DZ Ý Data Science MMW 2018 October 10, 2018 60 / 127
  • 61. Bounds on the Generalization Error Theorem:(Hoeffding’s inequality) Let Z1, Z2, · · · , Zn be a collection of i.i.d random variables with Zi ∈ [a, b]. Then, ∀ǫ > 0, Pr 1 n n i=1 Zi − E[Z] > ǫ ≤ 2 exp −2nǫ2 (b − a)2 corollary:(hoeffding’s inequality for sample proportions) Let Z1, Z2, · · · , Zn be a collection of i.i.d random variables from a Bernoulli distribution with ”success” probability p. Let pn = 1 n n i=1 Zi. Clearly, pn ∈ [0, 1] and E[pn] = p. Therefore, as a direct consequence of the above theorem, we have, ∀ǫ > 0, Pr [|pn − p| > ǫ] ≤ 2 exp −2nǫ2 DZ Ý Data Science MMW 2018 October 10, 2018 61 / 127
  • 62. Bounds on the Generalization Error So we have, ∀ǫ > 0, Pr [|pn − p| > ǫ] ≤ 2 exp −2nǫ2 Now, setting δ = 2 exp(−2ǫ2n), it is straightforward to see that the hoeffding-based 1 − δ level confidence bound on the difference between R(f) and Rn(f) for a fixed classifier f is given by |Rn(f) − R(f)| < ln 2 δ 2n 1/2 . Which of the two bounds is tighter? Clearly, we need to find out which of ln 2/δ or 1/2δ is larger. This is the same as comparing exp(1/2δ) and 2/δ, which in turns means comparing a(2/δ) and 2/δ where a = exp(1/4). With δ > 0, a(2/δ) > 2/δ, so that, we know that hoeffding’s bounds are tighter. The graph also confirm this. DZ Ý Data Science MMW 2018 October 10, 2018 62 / 127
  • 63. Bounds on the Generalization Error 0 2000 4000 6000 8000 10000 12000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 n = Sample size Theoreticalboundf(n,δ) Chernoff vs Chebyshev bounds for proportions: delta = 0.01 Chernoff Chebyshev 0 2000 4000 6000 8000 10000 12000 0 0.05 0.1 0.15 0.2 0.25 n = Sample size Theoreticalboundf(n,δ) Chernoff vs Chebyshev bounds for proportions: delta = 0.05 Chernoff Chebyshev DZ Ý Data Science MMW 2018 October 10, 2018 63 / 127
  • 64. Beyond Chernov and Hoeffding In all the above, we only addressed pointwise convergence of Rn(f) to R(f), i.e., for Fix a machine f ∈ F, we studied the convergence of Rn(f) to R(f). Needless to mention that that pointwise convergence is of very little use here. A more interesting issue to address is uniform convergence. That is, for all machines, f ∈ F, determine the necessary and sufficient conditions for the convergence of sup f∈F |Rn(f) − R(f)| > ǫ to 0. Clearly, such a study extends the Law of Large Numbers to function spaces, thereby providing tools for the construction of bounds on the theoretical errors of learning machines. DZ Ý Data Science MMW 2018 October 10, 2018 64 / 127
  • 65. Beyond Chernov and Hoeffding Since uniform convergence requires the consideration of the entirety of the function space of interest, care needs to be taken regarding the dimensionality of the function space. Uniform convergence will prove substantially easier to handle for finite sample spaces than for infinite dimensional function spaces. Indeed, infinity dimensional spaces, one will need to introduce such concepts of the capacity of the function space, measured through devices such as the VC-dimension and covering numbers. DZ Ý Data Science MMW 2018 October 10, 2018 65 / 127
  • 66. Beyond Chernov and Hoeffding Theorem: If Rn(f) and R(f) are close for all f ∈ F, i.e., ∀ǫ > 0, sup f∈F |Rn(f) − R(f)| ≤ ǫ, then R(fn) − R(f+ ) ≤ 2ǫ. Proof:Recall that we did define fn as the best function that is yielded by the empirical risk Rn(f) in the function class F. Recall also that Rn(fn) can be made as small as possible as we saw earlier. Therefore, with f+ being the best true risk in class F, we always have Rn(f+ ) − Rn(fn) ≥ 0. As a result, R(fn) = R(fn) − R(f+ ) + R(f+ ) = Rn(f+ ) − Rn(fn) + R(fn) − R(f+ ) + R(f+ ) ≤ 2sup f∈F |R(f) − Rn(f)| + R(f+ ) DZ Ý Data Science MMW 2018 October 10, 2018 66 / 127
  • 67. Beyond Chernov and Hoeffding Proof:Recall that we did define fn as the best function that is yielded by the empirical risk Rn(f) in the function class F. Recall also that Rn(fn) can be made as small as possible as we saw earlier. Therefore, with f+ being the best true risk in class F, we always have Rn(f+ ) − Rn(fn) ≥ 0. As a result, R(fn) = R(fn) − R(f+ ) + R(f+ ) = Rn(f+ ) − Rn(fn) + R(fn) − R(f+ ) + R(f+ ) ≤ 2sup f∈F |R(f) − Rn(f)| + R(f+ ) Consequently, R(fn) − R(f+ ) ≤ 2sup f∈F |R(f) − Rn(f)| as required. DZ Ý Data Science MMW 2018 October 10, 2018 67 / 127
  • 68. Beyond Chernov and Hoeffding Corollary: A direct consequence of the above theorem is the following: For a given machine f ∈ F, R(f) ≤ Rn(f) + ln 2 δ 2n 1/2 with probability at least 1 − δ, ∀δ > 0. If the function class F is finite, ie F = {f1, f2, · · · , fm} where m = |F| = #F = Number of functions in the class F then it can be shown that, for all f ∈ F, R(f) ≤ Rn(f) + ln m + ln 2 δ 2n 1/2 with probability at least 1 − δ, ∀δ > 0. DZ Ý Data Science MMW 2018 October 10, 2018 68 / 127
  • 69. Beyond Chernov and Hoeffding It can also be shown that R( ˆfn) ≤ Rn(f+ ) + 2 ln m + ln 2 δ 2n 1/2 (12) with probability at least 1 − δ, ∀δ > 0, where as before f+ = arg inf f∈F R(f) and ˆfn = argmin f∈F Rn(f). Equation (12) is of foundational importance, because it reveals clearly that the size of the function class controls the uniform bound on the crucial generalization error: Indeed, if the size m of the function class F increases, then R(f+) is caused to increase while R(fn) decreases, so that the trade-off between the two is controlled by the size m of the function class. DZ Ý Data Science MMW 2018 October 10, 2018 69 / 127
  • 70. Vapnik-Chervonenkis Dimension Definition: (Shattering) Let X = ∅ be any non empty domain. Let F ⊆ 2X be any non-empty class of functions having X as their domain. Let S ⊆ X be any finite subset of the domain X. Then S is said to be shattered by F iff {S ∩ f | f ∈ F} = 2S In other words, F shatters S if any subset of S can be obtained by intersecting S with some set from F. Example: A class F ⊆ 2X of classifiers is said to shatter a set x1, x2, · · · , xn of n points, if, for any possible configuration of labels y1, y2, · · · , yn, we can find a classifier f ∈ F that reproduces those labels. DZ Ý Data Science MMW 2018 October 10, 2018 70 / 127
  • 71. Vapnik-Chervonenkis Dimension Definition(VC-dimension) Let X = ∅ be any non empty learning domain. Let F ⊆ 2X be any non-empty class of functions having X as their domain. Let S ⊆ X be any finite subset of the domain X. The VC dimension of F is the cardinality of the largest finite set S ⊆ X that is shattered by F, ie V Cdim(F) := max |S| : S is shattered by F Note: If arbitrarily large finite sets are shattered by F, then V Cdim(F) = ∞. In other words, if a small set of finite cardinality cannot be found that is shattered by F, then V Cdim(F) = ∞. Example: The VC dimension of a class F ⊆ 2X of classifiers is the largest number of points that F can shatter. DZ Ý Data Science MMW 2018 October 10, 2018 71 / 127
  • 72. Vapnik-Chervonenkis Dimension Remarks: If V Cdim(F) = d, then there exists a finite set S ⊆ X such that |S| = d and S is shattered by F. Importantly, every set S ⊆ X such that |S| > d is not shattered by F. Clearly, we do not expect to learn anything until we have at least d training points. Intuitively, this means that an infinite VC dimension is not desirable as it could imply the impossibility to learn the concept underlying any data from the population under consideration. However, a finite VC dimension does not guarantee the learnability of the concept underlying any data from the population under consideration either. Fact: Let F be any finite function (concept) class. Then, since it requires 2d distinct concepts to shatter a set of cardinality d, no set of cardinality greater than log |F| can be shattered. Therefore, log |F| is always an upper bound for the VC dimension of finite concept classes. DZ Ý Data Science MMW 2018 October 10, 2018 72 / 127
  • 73. Vapnik-Chervonenkis Dimension To gain insights into the central concept of VC dimension, we herein consider a few examples of practical interest for which the VC dimension can be found. VC dimension of the space of separating hyperplanes: Let X = Rp be the domain for the binary Y ∈ {−1, +1} classification task, and consider using hyperplanes to separate the points of X. Let F denote the class of all such separating hyperplanes. Then, V Cdim(F) = p + 1 Intuitively, the following pictures for the case of X = R2 help see why the VC dimension is p + 1. DZ Ý Data Science MMW 2018 October 10, 2018 73 / 127
  • 74. Foundation Result in Statistical Learning Theory Theorem:(Vapnik and Chervonenkis, 1971) Let F be a class of functions implementing so learning machines, and let ζ = V Cdim(F) be the VC dimension of F. Let the theoretical and the empirical risks be defined as earlier and consider any data distribution in the population of interest. Then ∀f ∈ F, the prediction error (theoretical risk) is bounded by R(f) ≤ ˆRn(f) + ζ log 2n ζ + 1 − log η 4 n (13) with probability of at least 1 − η. or Pr TestError ≤ TrainError + ζ log 2n ζ + 1 − log η 4 n ≥ 1 − η DZ Ý Data Science MMW 2018 October 10, 2018 74 / 127
  • 75. Confidence Interval for a proportion p ∈ ˆp − zα/2 ˆp(1−ˆp) n , ˆp + zα/2 ˆp(1−ˆp) n with 100(1 − α)% confidence 0.0 0.2 0.4 0.6 020406080100 Building 95 % CIs. Here 98 intervals out of 100 contain p. That is 98 % lower bound and upper bound of interval Sampleindex DZ Ý Data Science MMW 2018 October 10, 2018 75 / 127
  • 76. Confidence Interval for a proportion p ∈ ˆp − zα/2 ˆp(1−ˆp) n , ˆp + zα/2 ˆp(1−ˆp) n with 100(1 − α)% confidence 0.2 0.4 0.6 0.8 1.0 020406080100 Building 95 % CIs. Here 94 intervals out of 100 contain p. That is 94 % lower bound and upper bound of interval Sampleindex DZ Ý Data Science MMW 2018 October 10, 2018 76 / 127
  • 77. Confidence Interval for a proportion p ∈ ˆp − zα/2 ˆp(1−ˆp) n , ˆp + zα/2 ˆp(1−ˆp) n with 100(1 − α)% confidence 0.2 0.4 0.6 0.8 020406080100 Building 90 % CIs. Here 92 intervals out of 100 contain p. That is 92 % lower bound and upper bound of interval Sampleindex DZ Ý Data Science MMW 2018 October 10, 2018 77 / 127
  • 78. Confidence Interval for a population mean µ ∈ ¯x − zα/2 σ2) n , ¯x + zα/2 σ2 n with 100 × (1 − α)% confidence 8 9 10 11 020406080100 Building 95 % CIs. Here 98 intervals out of 100 contain mu. That is 98 % lower bound and upper bound of interval Sampleindex DZ Ý Data Science MMW 2018 October 10, 2018 78 / 127
  • 79. Confidence Interval for a population mean µ ∈ ¯x − zα/2 σ2) n , ¯x + zα/2 σ2 n with 100 × (1 − α)% confidence 8 9 10 11 020406080100 Building 85 % CIs. Here 90 intervals out of 100 contain mu. That is 90 % lower bound and upper bound of interval Sampleindex DZ Ý Data Science MMW 2018 October 10, 2018 79 / 127
  • 80. Effect of Bias-Variance Dilemma of Prediction Optimal Prediction achieved at the point of bias-variance trade-off. DZ Ý Data Science MMW 2018 October 10, 2018 80 / 127
  • 81. VC Bound for Separating Hyperplanes î Let L represent the function class of binary classifiers in q-dimension, ie L = f : ∃w ∈ IRq , w0 ∈ IR, f(x) = sign(w⊤ x + w0), ∀x ∈ X , then VCDim(L) = h = q + 1. With labels taken from {−1, +1}, and using the 0/1 loss function, we have the fundamental theorem from Vapnik and Chervonenkis, namely, ê¦ For every f ∈ L, and n > h, with probability at least 1 − η, we have R(f) ≤ Rn(f) + h log 2n h + 1 + log 4 η n The above result holds true for LDA.DZ Ý Data Science MMW 2018 October 10, 2018 81 / 127
  • 82. Appeal of the VC Bound Note: One of the greatest appeals of the VC bound is that, though applicable to function classes of infinite dimension, it preserves the same intuitive form as the bound derived for finite dimensional F. Essentially, using the VC dimension concept, the number L of possible labeling configurations obtainable from F with V Cdim(F) = ζ over 2n points verifies L ≤ en ζ ζ . (14) The VC bound is simply obtained by replacing log |F| with L in the expression of the risk bound for finite dimensional F. The most important part of the above theorem is the fact that the generalization ability of a learning machine depends on both the empirical risk and the complexity of the class of functions used, which is measured here by the VC dimension of (Vapnik and Chervonenkis, 1971). DZ Ý Data Science MMW 2018 October 10, 2018 82 / 127
  • 83. Appeal of the VC Bound Also, the bounds offered here are distribution-free, since no assumption is made about the distribution of the population. The details of this important result will be discussed again in chapter 6 and 7, where we will present other measures of the capacity of a class of functions. Remark: From the expression of the VC Bound, it is clear that an intuitively appealing way to improve the predictive performance (reduce prediction error) of a class of machines is to achieve a trade-off (compromise) between small VC dimension and minimization of the empirical risk. At first, it may seen as if the VC bound is acting in a way similar to the number of parameters, since it serves as a measure of the complexity of F. In this spirit, the following is a possible guiding principle. DZ Ý Data Science MMW 2018 October 10, 2018 83 / 127
  • 84. Appeal of the VC Bound At first, it may seen as if the VC bound is acting in a way similar to the number of parameters, since it serves as a measure of the complexity of F. In this spirit, the following is a possible guiding principle. Intuition: One should seek to construct a classifier that achieves the best trade-off (balance, compromise) between complexity of function class - measured by VC dimension- and fit to the training data -measured by empirical risk. Now equipped with this sound theoretical foundation one can then go on to the implementation of varioous learning machines. We shall use R to discover some of the most commonly learning machines. DZ Ý Data Science MMW 2018 October 10, 2018 84 / 127
  • 85. Regression Analysis DZ Ý Data Science MMW 2018 October 10, 2018 85 / 127
  • 86. Regression Analysis Dataset rating complaints privileges learning raises critical advance 43 51 30 39 61 92 45 63 64 51 54 63 73 47 71 70 68 69 76 86 48 61 63 45 47 54 84 35 81 78 56 66 71 83 47 43 55 49 44 54 49 34 58 67 42 56 66 68 35 71 75 50 55 70 66 41 72 82 72 67 71 83 31 67 61 45 47 62 80 41 64 53 53 58 58 67 34 67 60 47 39 59 74 41 69 62 57 42 55 63 25 What are the factors that drive the rating of companies? head(attitude) DZ Ý Data Science MMW 2018 October 10, 2018 86 / 127
  • 87. Regression Analysis Dataset lcavol lweight age lbph svi lcp gleason pgg45 lpsa -0.58 2.77 50 -1.39 0 -1.39 6 0 -0.43 -0.99 3.32 58 -1.39 0 -1.39 6 0 -0.16 -0.51 2.69 74 -1.39 0 -1.39 7 20 -0.16 -1.20 3.28 58 -1.39 0 -1.39 6 0 -0.16 0.75 3.43 62 -1.39 0 -1.39 6 0 0.37 -1.05 3.23 50 -1.39 0 -1.39 6 0 0.77 0.74 3.47 64 0.62 0 -1.39 6 0 0.77 0.69 3.54 58 1.54 0 -1.39 6 0 0.85 -0.78 3.54 47 -1.39 0 -1.39 6 0 1.05 0.22 3.24 63 -1.39 0 -1.39 6 0 1.05 0.25 3.60 65 -1.39 0 -1.39 6 0 1.27 -1.35 3.60 63 1.27 0 -1.39 6 0 1.27 What are the factors responsible for prostate cancer? library(ElemStatLearn); data(prostate) DZ Ý Data Science MMW 2018 October 10, 2018 87 / 127
  • 88. Motivating Example Regression Analysis Consider the univariate function f ∈ C([0, 2π]) given by f(x) = π 2 x + 3 4 π cos π 2 (1 + x) (15) Simulate of an artificial iid data set D = {(xi, yi), i = 1, · · · , n}, with n = 99 and σ = π/3 xi ∈ [0, 2π] drawn deterministically and equally spaced Yi = f(xi) + εi εi iid ∼ N(0, σ2) The R code is f <- function(x){(pi/2)*x + (3*pi/4)*cos((pi/2)*(1+x))} x <- seq(0, 2*pi, length=n) y <- f(x) + rnorm(n, 0, pi/3) DZ Ý Data Science MMW 2018 October 10, 2018 88 / 127
  • 89. Motivating Example Regression Analysis Noisy data generated with function (19) 0 1 2 3 4 5 6 0510 x y Question: What is the best hypothesis space to learn the underlying function? DZ Ý Data Science MMW 2018 October 10, 2018 89 / 127
  • 90. Bias-Variance Tradeoff in Action 0 1 2 3 4 5 6 0510 x y (a) Underfit 0 1 2 3 4 5 6 0510 x y (b) Optimal fit 0510 y DZ Ý Data Science MMW 2018 October 10, 2018 90 / 127
  • 91. Introduction to Regression Analysis We have, xi = (xi1, · · · , xip)⊤ ∈ IRp and Yi ∈ IR, and data set D = (x1, Y1), (x2, Y2), · · · , (xn, Yn) We assume that the response variable Yi is related to the explanatory vector xi through a function f via the model, Yi = f(xi) + ξi, i = 1, · · · , n (16) The explanatory vectors xi are fixed (non-random) The regression function f : IRp → IR is unknown The error terms ξi are iid Gaussian, i.e. ξi iid ∼ N(0, σ2 ) Goal: We seek to estimate the function f using the data in D. DZ Ý Data Science MMW 2018 October 10, 2018 91 / 127
  • 92. Formulation of the regression problem Let X and Y be two random variables s.t E[Y ] = µ and E[Y 2 ] < ∞ . Goal: Find the best predictor f(X) of Y given X. Important Questions How does one define ”best”? Is the very best attainable in practice? What does the function f look like? (Function class) How do we select a candidate from the chosen class of functions? How hard is it computationally to find the desired function? DZ Ý Data Science MMW 2018 October 10, 2018 92 / 127
  • 93. Loss functions 1 When f(X) is used to predict Y , a loss is incurred. Question: How is such a loss quantified? Answer: Define a suitable loss function. 2 Common loss functions in regression Squared error loss or (ℓ2) loss ℓ(Y, f(X)) = (Y − f(X))2 ℓ2 is by far the most used (prevalent) because of its differentiability. Unfortunately, not very robust to outliers. Absolute error loss or (ℓ1) loss ℓ(Y, f(X)) = |Y − f(X)| ℓ1 is more robust to outliers, but not differentiable at zero. 3 Note that ℓ(Y, f(X)) is a random variable. DZ Ý Data Science MMW 2018 October 10, 2018 93 / 127
  • 94. Risk Functionals and Cost Functions 1 Definition of a risk functional, R(f) = E[ℓ(Y, f(X))] = X×Y ℓ(y, f(x))pXY (x, y)dxdy R(f) is the expected loss over all pairs of the cross space X × Y. 2 Ideally, one seeks the best out of all possible functions, i.e., f∗ (X) = arg min f R(f) = arg min f E[ℓ(Y, f(X))] f∗(·) is such that R∗ = R(f∗ ) = min f R(f) 3 This ideal function cannot be found in practice, because the fact that the distributions are unknown, make it impossible to form an expression for R(f). DZ Ý Data Science MMW 2018 October 10, 2018 94 / 127
  • 95. Cost Functions and Risk Functionals Theorem: Under regularity conditions, f∗ (X) = E[Y |X] = arg min f E[(Y − f(X))2 ] Under the squared error loss, the optimal function f∗ that yields the best prediction of Y given X is no other than the expected value of Y given X. Since we know neither pXY (x, y) nor pX(x), the conditional expectation E[Y |X] = Y ypY |X(y)(dy) = Y y pXY (x, y) pX(x) dy cannot be directly computed. DZ Ý Data Science MMW 2018 October 10, 2018 95 / 127
  • 96. Empirical Risk Minimization Let D = (X1, Y1), (X2, Y2), · · · , (Xn, Yn) represent an iid sample The empirical version of the risk functional is R(f) = MSE(f) = E[(Y − f(X))2] = 1 n n i=1 (Yi − f(Xi))2 It turns out that R(f) provides an unbiased estimator of R(f). We therefore seek the best by empirical standard, ˆf∗ (X) = arg min f MSE(f) = arg min f 1 n n i=1 (Yi − f(Xi))2 Since it is impossible to search all possible functions, it is usually crucial to choose the ”right” function space. DZ Ý Data Science MMW 2018 October 10, 2018 96 / 127
  • 97. Function spaces For the function estimation task for instance, one could assume that the input space X is a closed and bounded interval of IR, i.e. X = [a, b], and then consider estimating the dependencies between x and y from within the space F all bounded functions on X = [a, b], i.e., F = {f : X → IR| ∃B ≥ 0, such that |f(x)| ≤ B, for all x ∈ X}. One could even be more specific and make the functions of the above F continuous, so that the space to search becomes F = {f : [a, b] → IR| f is continuous} = C([a, b]), which is the well-known space of all continuous functions on a closed and bounded interval [a, b]. This is indeed a very important function space. DZ Ý Data Science MMW 2018 October 10, 2018 97 / 127
  • 98. Space of Univariate Polynomials In fact, polynomial regression consists of searching from a function space that is a subspace of C([a, b]). In other words, when we are doing the very common polynomial regression, we are searching the space P([a, b]) = {f ∈ C([a, b])| f is a polynomial with real coefficients} . It is interesting to note that Weierstrass did prove that P([a, b]) is dense in C([a, b]). One considers the space of all polynomial of some degree p, i.e., F = Pp ([a, b]) = f ∈ C([a, b])| ∃β0, β1, · · · , βp ∈ IR| f(x) = p j=0 βjxj , ∀x ∈ [a, b]    DZ Ý Data Science MMW 2018 October 10, 2018 98 / 127
  • 99. Empirical Risk Minimization in F Having chosen a class F of functions, we can now seek ˆf(X) = arg min f∈F MSE(f) = arg min f∈F 1 n n i=1 (Yi − f(Xi))2 We are seeking the best function in the function space chosen. For instance, if the function space in the space of all polynomials of degree p in some interval [a, b], finding ˆf boils down to estimating the coefficients of the polynomial using the data, namely ˆf(x) = ˆβ0 + ˆβ1x + ˆβ2x2 + · · · + ˆβpxp where using β = (β0, β1, · · · , βp)⊤, we have ˆβ = arg min β∈IRp+1    1 n n i=1  Yi − p j=0 βjxj i   2   DZ Ý Data Science MMW 2018 October 10, 2018 99 / 127
  • 100. Important Aspects of Statistical Learning It is very tempting at first to use the data at hand to find/build the ˆf that makes MSE( ˆf) is the smallest. For instance, the higher the value of p, the smaller MSE( ˆf(·)) will get. The estimate ˆβ = (ˆβ0, ˆβ1, · · · , ˆβp)⊤ of β = (β0, β1, · · · , βp)⊤, is a random variable, and as a result the estimate ˆf(x) = ˆβ0 + ˆβ1x + ˆβ2x2 + · · · + ˆβpxp of f(x) is also a random variable. Since ˆf(x) is random variable, we must compute important aspects like its bias B[ ˆf(x)] = E[ ˆf(x)] − f(x) and its variance V[ ˆf(x)]. We have a dilemma: If we make ˆf complex (large p), we make the bias small but the variance is increased. If we make ˆf simple (small p), we make the bias large but the variance is decreased. Most of Modern Statistical Learning is rich with model selection techniques that seek to achieve a trade-off between bias and variance to get the optimal model. Principle of parsimony (sparsity), Ockham’s razor principle. DZ Ý Data Science MMW 2018 October 10, 2018 100 / 127
  • 101. Effect of Bias-Variance Dilemma of Prediction Optimal Prediction achieved at the point of bias-variance trade-off. DZ Ý Data Science MMW 2018 October 10, 2018 101 / 127
  • 102. Theoretical Aspects of Statistical Regression Learning Just like we have a VC bound for classification, there is one for regression, ie when Y = IR and ˆRn(f) = 1 n n i=1 |yi − f(xi)|2 = Squared error loss Indeed, for every f ∈ F, with probability at least 1 − η, we have R(f) ≤ ˆRn(f) (1 − c √ δ)+ where δ = a n v + v log bn v − log η 4 Note once again as before that these bounds are not asymptotic Unfortunately these bounds are known to be very loose in practice. DZ Ý Data Science MMW 2018 October 10, 2018 102 / 127
  • 103. The pitfalls of memorization and overfitting The trouble - limitation - with naively using a criterion on the whole sample lies in the fact, given a sample (x1, y1), (x2, y2), · · · , (xn, yn), the function ˆfmemory defined by ˆfmemory(xi) = yi, i = 1, · · · , n always achieves the best performance, since MSE( ˆfmemory) = 0, which is the minimum achievable. Where does the limitation of ˆfmemory come from? Well, ˆfmemory does not really learn the dependency between X and Y . While it may have some of it, it also grabs a lot of the noise in the data, and ends overfitting the data. As a result of not really learning the structure of the relationship between X and Y and only merely memorizing the present sample values, ˆfmemory will predict very poorly when presented with observations that were not in the sample. DZ Ý Data Science MMW 2018 October 10, 2018 103 / 127
  • 104. Training Set Test Set Split Splitting the data into training set and test set: It makes sense to judge models (functions), not on how they perform with in sample observations, but instead how they perform on out of sample cases. Given a collection D = (x1, y1), (x2, y2), · · · , (xn, yn) of pairs, Randomly split D into training set of size ntr and test set of size nte, such that ntr + nte = n Training set Tr = (x (tr) 1 , y (tr) 1 ), (x (tr) 2 , y (tr) 2 ), · · · , (x (tr) ntr , y (tr) ntr ) Training set Te = (x (te) 1 , y (te) 1 ), (x (te) 2 , y (te) 2 ), · · · , (x (te) nte , y (te) nte ) DZ Ý Data Science MMW 2018 October 10, 2018 104 / 127
  • 105. Training Set Test Set Split For each function class F (linear models, nonparametrics, etc ...) Find the best in its class based on the training set Tr For all the estimated functions ˆf1, ˆf2, · · · , ˆfm, compute the training error MSETr( ˆfj) = 1 ntr ntr i=1 (y (tr) i − ˆfj(x (tr) i ))2 For all the estimated functions ˆf1, ˆf2, · · · , ˆfm, compute the test error MSETe( ˆfj) = 1 nte nte i=1 (y (te) i − ˆfj(x (te) i ))2 Compute the averages of both MSETr and MSETe over many random splits of the data, and tabulate (if necessary) those averages. Select ˆfj∗ such that mean[MSETe( ˆfj∗ )] < mean[MSETe( ˆfj)], j = 1, 2, · · · , m, j = j∗ DZ Ý Data Science MMW 2018 October 10, 2018 105 / 127
  • 106. Computational Comparisons Ideally, we would like to compare the true theoretical performances measured by the risk functional R(f) = E[ℓ(Y, f(X))] = X×Y ℓ(x, y)dP(x, y), (17) Instead, we build the estimators using other optimality criteria, and then compare their predictive performances using the average test error AVTE(·), namely AVTE(f) = 1 R R r=1 1 m m t=1 ℓ(y (r) it , fr(x (r) it )) , (18) where fr(·) is the r-th realization of the estimator f(·) built using the training portion of the split of D into training set and test set, and x (r) it , y (r) it is the t-th observation from the test set at the r-th random replication of the split of D. DZ Ý Data Science MMW 2018 October 10, 2018 106 / 127
  • 107. Learning Machines when n ≪ p Machines Inherently designed to handle p larger than n problems Classification and Regression Trees Support Vector Machines Relevance Vector Machines (n < 500) Gaussian Process Learning Machines (n < 500) k-Nearest Neighbors Learning Machines (Watch for the curse of dimensionality) Kernel Machines in general Machines that cannot inherently handle p larger than n problems, but can do so if regularized with suitable constraints Multiple Linear Regression Models Generalized Linear Models Discriminant Analysis Ensemble Learning Machines Random Subspace Learning Ensembles (Random Forest) Boosting and its extensions DZ Ý Data Science MMW 2018 October 10, 2018 107 / 127
  • 108. Motivating Example Regression Analysis Consider the univariate function f ∈ C([−1, +1]) given by f(x) = −x + √ 2 sin(π3/2 x2 ) (19) Simulate of an artificial iid data set D = {(xi, yi), i = 1, · · · , n}, with n = 99 and σ = 3/10 xi ∈ [−1, +1] drawn deterministically and equally spaced Yi = f(xi) + εi εi iid ∼ N(0, σ2) The R code is f <- function(x){-x + sqrt(2)*sin(pi^(3/2)*x^2)} x <- seq(-1, +1, length=n) y <- f(x) + rnorm(n, 0, 3/10) DZ Ý Data Science MMW 2018 October 10, 2018 108 / 127
  • 109. Estimation Error and Prediction Error 1 1 1 1 1 1 1 1 1 1 11 1 1 11 1 11 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 11 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 111 11 1 11 1 11 1 11 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 −1.0 −0.5 0.0 0.5 1.0 −3−2−1012 Predictive Regression with confidence and prediction bands xnew f(xnew) data points fit lower conf upper conf lower pred upper pred Figure: Simple Orthogonal Polynomial Regression of with both confidence bands and prediction bands on the test set. The true function is f(x) = −x + √ 2 sin(π3/2 x2 ) for x ∈ [−1, +1]. DZ Ý Data Science MMW 2018 October 10, 2018 109 / 127
  • 110. Training Error and Test Error Table: Average Training Error and Average Test Error over m = 10 random splits of n = 300 observations generated from a population with true function f(x) = −x + √ 2 sin(π3/2 x2 ) for x ∈ [−1, +1]. The noise variance in this case is σ2 = 0.32 . Each split has ntr= 2n/3. Approximating Function Class Poly SVM RVM GPR Average Training Error 0.0998 0.0335 0.0295 0.1861 Test Error 0.3866 0.1465 0.1481 0.1556 DZ Ý Data Science MMW 2018 October 10, 2018 110 / 127
  • 111. Unsupervised Learning DZ Ý Data Science MMW 2018 October 10, 2018 111 / 127
  • 112. Finding Patterns in Job Sector Allocations in Europe Example 1: Consider the following portion of observations on job sectors distribution in Europe in the 1990s. Agr Min Man PS Con SI Fin SPS TC Italy 15.9 0.6 27.6 0.5 10.0 18.1 1.6 20.1 5.7 Poland 31.1 2.5 25.7 0.9 8.4 7.5 0.9 16.1 6.9 Rumania 34.7 2.1 30.1 0.6 8.7 5.9 1.3 11.7 5.0 USSR 23.7 1.4 25.8 0.6 9.2 6.1 0.5 23.6 9.3 Denmark 9.2 0.1 21.8 0.6 8.3 14.6 6.5 32.2 7.1 France 10.8 0.8 27.5 0.9 8.9 16.8 6.0 22.6 5.7 1 Can European countries by divided into meaningful groups (clusters)? 2 How many concepts? How many clusters (groups) of countries? Analogy: Clustering in such an example can be thought of as unsupervised classification (pattern recognition) DZ Ý Data Science MMW 2018 October 10, 2018 112 / 127
  • 113. Hierarchical Clustering for European Job Sector Data One solution: Mining Job Sectors in Europe in the 1990s via Hierarchical Clustering with Manhattan distance and ward linkage. Belgium UnitedKingdom Denmark Sweden Netherlands Norway France Finland Italy Luxembourg Austria E.Germany W.Germany Switzerland Spain Rumania Portugal Poland Czechoslovakia Bulgaria Hungary Ireland USSR Turkey Greece Yugoslavia 050100150200250300350 Cluster Dendrogram hclust (*, "ward") dist(europe, method = "manhattan") Height How does the distance affect the clustering? How does the linkage affect the clustering? What makes a clustering satisfactory? How does one compare two clusterings? Some interesting tasks: 1 Investigate different distances with same linkage 2 Investigate different linkages with same distance DZ Ý Data Science MMW 2018 October 10, 2018 113 / 127
  • 114. Extracting Patterns of Voting in America Example 2: Percentages of Votes given to the U. S. Republican Presidential Candidate - 1856-1976. X1856 X1860 X1864 X1868 X1900 X1904 X1908 Alabama NA NA NA 51.44 34.67 20.65 24.38 Arkansas NA NA NA 53.73 35.04 40.25 37.31 California 18.77 32.96 58.63 50.24 54.48 61.90 55.46 Colorado NA NA NA NA 42.04 55.27 46.88 Connecticut 53.18 53.86 51.38 51.54 56.94 58.13 59.43 Delaware 2.11 23.71 48.20 40.98 53.65 54.04 52.09 Florida NA NA NA NA 19.03 21.15 21.58 1 Can the states be grouped into clusters of republican-ness? 2 How do missing values influence the clustering? Analogy: Again, clustering in such an example can be thought of as unsupervised classification (pattern recognition) DZ Ý Data Science MMW 2018 October 10, 2018 114 / 127
  • 115. Example: Image Denoising For an observed image of size r × c, posit the model y = Wx + z. (20) The original image is represented by a p × 1 vector, which makes the matrix W a matrix of dimension q × p, where q = rc. We therefore have z⊤ = (z1, · · · , zq) ∈ IRq , x⊤ = (x1, · · · , xp) ∈ IRp , y⊤ = (y1, · · · , yq) ∈ IRq . DZ Ý Data Science MMW 2018 October 10, 2018 115 / 127
  • 116. Example: Image Denoising Expression of the solution: If E(x) = y − Wx 2 + λ x 1 is our objective function to be minimized, and ˆx is a point at which the minimum is achieved, then we will write ˆx = arg min x∈IRp y − Wx 2 + λ x 1 . (21) DZ Ý Data Science MMW 2018 October 10, 2018 116 / 127
  • 117. Example: Recommender System Consider a system in which n customers have access to p different products, like movies, clothing, rental cars, etc ... A1 A2 · · · Aj · · · Ap C1 C2 ... Ci w(i, j) ... Cn Table: Typical Representation of a Recommender System The value of w(i, j) is the rating assigned to article Aj by customer Ci. DZ Ý Data Science MMW 2018 October 10, 2018 117 / 127
  • 118. Example: Recommender System The main ingredient in Recommender Systems is the matrix W =           w11 w12 · · · w1j · · · w1p w21 w22 · · · w2j · · · w2p ... ... ... ... · · · ... wi1 wi2 · · · wij · · · wip ... ... ... ... · · · ... wn1 wn2 · · · wnj · · · wnp           The Matrix W is typical very (and I mean very) sparse, which makes sense because people can only consume so many articles, and there are articles some people will never consume even if some suggested. DZ Ý Data Science MMW 2018 October 10, 2018 118 / 127
  • 119. Time Series and State Space Models DZ Ý Data Science MMW 2018 October 10, 2018 119 / 127
  • 120. IID Process and White Noise Time ts(X) 0 50 100 150 200 −2−10123 Time ts(W) 0 50 100 150 200 −2−1012 (Left) White noise process (Right) IID Process. What is the statistical model (if any) underlying the data? DZ Ý Data Science MMW 2018 October 10, 2018 120 / 127
  • 121. Random Walk in 1d and 2d Time ts(X) 0 50 100 150 200 −4−20246810 −10 −5 0 5 10 15 20 −20−100102030 Xt Yt (Left) Random walk in 1 dimension (Right) Random Walk in 2 dimensions (plane). What is the statistical model (if any) underlying the data? DZ Ý Data Science MMW 2018 October 10, 2018 121 / 127
  • 122. Real life Time Series: Air Passengers and Sunspots Time AirPassengers 1950 1952 1954 1956 1958 1960 100200300400500600 Time Sunspots 0 20 40 60 80 100 050100150 (Left) Number of airline passengers (Right) Longstanding Sunspots data. What is the statistical model (if any) underlying the data? DZ Ý Data Science MMW 2018 October 10, 2018 122 / 127
  • 123. Existing Computing Tools Do the following install.packages(’ctv’) library{ctv} install.views(’MachineLearning’) install.views(’HighPerformanceComputing’) install.views(’TimeSeries’) install.views(’Bayesian’) R packages for big data library{biglm} library(foreach) library(glmnet) library(kernlab) library(randomForest) library(ada) library(audio) library(rpart) DZ Ý Data Science MMW 2018 October 10, 2018 123 / 127
  • 124. Some Remarks and Recommendations Applications: Sharpen your intuition and your commonsense by questioning things, reading about interesting open applied problems, and attempt to solve as many problems as possible Methodology: Read and learn about the fundamental of statistical estimation and inference, get acquainted with the most commonly used methods and techniques, and consistently ask yourself and others what the natural extensions of the techniques could be. Computation: Learn and master at least two programming languages. I strongly recommend getting acquainted with R http://www.r-project.org Theory: ”Nothing is more practical than a good theory” (Vladimir N. Vapnik). When it comes to data mining and machine learning and predictive analytics, those who truly understand the inner workings of algorithms and methods always solve problems better. DZ Ý Data Science MMW 2018 October 10, 2018 124 / 127
  • 125. Machine Learning CRAN Task View in R Let’s visit the website where most of the R community goes http://www.r-project.org Let’s install some packages and get started install.packages(’ctv’) library(ctv) install.views(’MachineLearning’) install.views(’HighPerformanceComputing’) install.views(’Bayesian’) install.views(’Robust’) Let’s load a couple of packages and explore library(e1071) library(MASS) library(kernlab) DZ Ý Data Science MMW 2018 October 10, 2018 125 / 127
  • 126. Clarke, B. and Fokou´e, E. and Zhang, H. (2009). Principles and Theory for Data Mining and Machine Learning. Springer Verlag, New York, (ISBN: 978-0-387-98134-5), (2009) DZ Ý Data Science MMW 2018 October 10, 2018 126 / 127
  • 127. References Clarke, B., Fokou´e, E. and Zhang, H. H. (2009). Principles and Theory for Data Mining and Machine Learning. Springer Verlag, New York, (ISBN: 978-0-387-98134-5), (2009) James, G, Witten, D, Hastie, T and Tibshirani, R (2013). An Introduction to Statistical Learning with Applications in R. Springer, New York, (e-ISBN: 978-1-4614-7138-7),(2013) Vapnik, N. V.(1998). Statistical Learning Theory. Wiley, ISBN: 978-0-471-03003-4, (1998) Vapnik, N. V.(2000). The Nature of Statistical Learning Theory. Springer, ISBN 978-1-4757-3264-1, (2000) Hastie, T. and Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Edition. Springer, ISBN 978-0-387-84858-7 DZ Ý Data Science MMW 2018 October 10, 2018 127 / 127