Machine Learning Lectures on Classification Models

Machine Learning in High Energy Physics
Lectures 3 & 4
Alex Rogozhnikov
Lund, MLHEP 2016
1 / 99

Recapitulation
classiﬁcation, regression
kNN classiﬁer and regressor
ROC curve, ROC AUC
1 / 99

Bayes optimal classifier
Given exact distributions' density functions, we can build an optimal classiﬁer
Need to estimate ratio of likelihoods.
= ×
p(y = 1 | x)
p(y = 0 | x)
p(y = 1)
p(y = 0)
p(x | y = 1)
p(x | y = 0)
2 / 99

Density estimation
Histograms
Kernel density estimation
3 / 99

Parametric density estimation
single Gaussian distribution
Gaussian mixtures
EM algorithm
4 / 99

QDA (Quadratic discriminant analysis)
QDA follows generative
approach.
Main assumption is that
distribution of events within
each class is multivariate
gaussian.
5 / 99

Logistic regression
Decision function
Sharp rule:
d(x) =< w, x > +w0
= sgn d(x)ŷ
6 / 99

Logistic regression
Smooth rule:
Optimizing weights to maximize log-likelihood
d(x) =< w, x > +w0
(x)p+1
(x)p−1
=
=
σ(d(x))
σ(−d(x))
w, w0
 = − ln( ( )) = L( , ) → min
∑
i∈events
py
i
xi
∑
i
xi yi
7 / 99

Logistic loss
Term loss refers to somewhat we are minimizing. Losses typically estimate our
risks, denoted as .
LogLoss penalty for single observation:
Margin is expected to be high for all events.

 = − ln( ( )) = L( , ) → min
∑
i∈events
py
i
xi
∑
i
xi yi
L( , ) = − ln( ( )) =
{
= ln(1 + )xi yi py
i
xi
ln(1 + ),e
−d( )xi
ln(1 + ),e
d( )xi
= +1yi
= −1yi
e
− d( )y
i
xi
d( )yi xi
8 / 99

Logistic loss
is convex
function.
Simple analysis shows
that is sum of
convex functions w.r.t.
to , so the
optimization problem
has
at most one
optimum.
Comment: MLE is not guaranteed to be a good choice.
L( , )xi yi

w
9 / 99

Visualization of logistic regression
10 / 99

Gradient descent
Problem: ﬁnd to minimize .
Gradient descent:
is step size
(also called shrinkage, learning rate)
w 
w ← w − η
∂
∂w
η
11 / 99

Stochastic gradient descent (SGD)
On each iteration make a step using only one event:
take — random event from training data
 = L( , ) → min
1
N ∑
i
xi yi
i
w ← w − η
∂L( , )xi yi
∂w
12 / 99

Stochastic gradient descent (SGD)
On each iteration make a step using only one event:
take — random event from training data
Each iteration is done much faster, but training process is less stable.
Making smaller steps.
 = L( , ) → min
1
N ∑
i
xi yi
i
w ← w − η
∂L( , )xi yi
∂w
13 / 99

Stochastic gradient descent
We can decrease the learning rate over time: .
At iteration :
This process converges to local minima if:
ηt
t
w ← w − ηt
∂L( , )xi yi
∂w
= ∞, < ∞, > 0
∑
t
ηt
∑
t
η2
t
ηt
14 / 99

SGD with momentum
SGD (and GD) has problems with narrow valley (when hessian is very far from
identity)
Improvement: use momentum which accumulates gradient, 0.9 < γ < 1
v
w
←
←
γv + ηt
∂L( , )xi yi
∂w
w − v
15 / 99

Stochastic optimization methods
16 / 99

Stochastic optimization methods
applied to additive loss function
should be preferred when optimization time is the bottleneck
more advanced modiﬁcations exist:
AdaDelta, RMSProp, Adam.
those are using adaptive step size (individually for each sample)
crucial when scale of gradients is very different
in practice predictions are computed using minibatches (small groups of 16
to 256 samples) not on event-by-event basis
 = L( , )
∑
i
xi yi
17 / 99

Polynomial decision rule
d(x) = + +w0
∑
i
wi xi
∑
ij
wij xi xj
18 / 99

is again a linear model, introduce extended set of features:
and reuse logistic regression.
d(x) = + +w0
∑
i
wi xi
∑
ij
wij xi xj
z = {1} ∪ { ∪ {xi }i xi xj }ij
d(x) = =< w, z >
∑
i
wi zi
19 / 99

is again a linear model, introduce extended set of features:
and reuse logistic regression.
We can add as one more variable to dataset and forget about term:
d(x) = + +w0
∑
i
wi xi
∑
ij
wij xi xj
z = {1} ∪ { ∪ {xi }i xi xj }ij
d(x) = =< w, z >
∑
i
wi zi
= 1x0 w0
d(x) =< w, x >
20 / 99

Polynomial regression
is done in the same way.
E.g. to ﬁt the polynomial of one
variate, we constrict for each event
a vector of
and train a linear regression.
= (1, x, , , . . )x̃ x
2
x
3
x
d
d( ) = + x + +. . .x̃ w0 w1 w2 x
2
wd x
d
21 / 99

Projecting into the space of higher dimension
SVM with polynomial kernel visualization
22 / 99

Logistic regression overview
classiﬁer based on linear decision rule
training is reduced to convex optimization
stochastic optimization can be used
can handle > 1000 features, but requires regularization (see later)
no interaction between features
other decision rules are achieved by adding new features
23 / 99

Support Vector Machine [Vapnik, Chervonenkis, 1963]
SVM selects a decision rule with maximal possible margin (rule A).
24 / 99

Hinge loss function
SVM uses different loss function:
Margin no penalty
(only signal losses compared on
the plot)
( , ) = max(0, 1 − d( ))Lhinge xi yi yi xi
d( ) > 1 →yi xi
25 / 99

Kernel trick
is a projection operator (which "adds new features").
Assume that optimal (combination of support vectors) and
look for
We need only kernel, not projection operator:
P
d(x) = < w, x > → d(x) = < w, P(x) >new
w = P( )∑i
αi xi
αi
d(x) = < P( ), P(x) = K( , x)
∑
i
αi xi >new
∑
i
αi xi
K(x, ) =< P(x), P( )x̃ x̃ >new
26 / 99

Kernel trick
Polynomial kernel:
projection contains all monomials up to degree .
Popular kernel is a gaussian Radial Basis Function:
Corresponds to projection to the Hilbert space.
Exercise: ﬁnd a corresponding projection.
K(x, ) = (1 + xx̃ x̃
T
)
d
d
K(x, ) =x̃ e
−c||x− |x̃ |
2
27 / 99

Overfitting
nn with k=1 gives ideal classiﬁcation of training data.
SVM with small radius of RBF kernel has the same property.
k
30 / 99

Overfitting
Same issues for regression.
Provided high enough
degree, the polynomial can
go through any set of points
and get zero error this way.
31 / 99

There are two definitions of overfitting, which often coincide.
Difference-overfitting
(academical definition)
There is a significant difference in quality of predictions between train and
holdout.
Complexity-overfitting
(practitioners' definition)
Formula has too high complexity (e.g. too many parameters), increasing the
number of parameters drives to lower quality.
32 / 99

Model selection
Given two models, which one should we select?
33 / 99

Model selection
Given two models, which one should we select?
ML is about inference of statistical dependencies, which give us ability to predict
The best model is the model which gives better predictions for new
observations.
Simplest way to control this is to check quality on a holdout — a sample not
used during training (cross-validation).This gives unbiased estimate of quality for
new data.
estimates have variance
multiple testing introduces bias (solution: train + validation + test, like
kaggle)
34 / 99

Difference-overﬁtting is inessential, provided that we measure quality on a
holdout sample (though easy to check and sometimes helpful).
Complexity-overﬁtting is a problem — we need to test different parameters for
optimality (more examples through the course).
35 / 99

Difference-overfitting is inessential, provided that we measure quality on a
holdout sample (though easy to check and sometimes helpful).
Complexity-overfitting is a problem — we need to test different parameters for
optimality (more examples through the course).
Don't use distribution comparison to detect
overfitting
36 / 99

Reminder: linear regression
We can use linear function for regression:
Minimize MSE:
Explicit solution:
d(x) =< w, x >
 = (d( ) − → min∑i
xi yi )
2
( ) w =∑i
xi x
T
i
∑i
yi xi
38 / 99

Regularization: motivation
When the number of parameters is high (compared to the number of
observations)
hard to estimate reliably all parameters
linear regression with MSE:
in -dimensional space you can ﬁnd hyperplane through any points
non-unique solution if
the matrix degenerates
Solution 1: manually decrease dimensionality of the problem
Solution 2: use regularization
d d
n < d
∑i
xi x
T
i
39 / 99

Regularization
When number of parameters in model is high, overﬁtting is very probable
Solution: add a regularization term to the loss function:
regularization :
regularization:
regularization:
 = L( , ) + → min
1
N ∑
i
xi yi reg
L2 = α |reg ∑j
wj |
2
L1 = β | |reg ∑j
wj
+L1 L2 = α | + β | |reg ∑j
wj |
2
∑j
wj
40 / 99

, – regularizations
Dependence of parameters (components of ) on the regularization (stronger
regularization to the left)
regularization (solid), (dashed)
L2 L1
w
L2 L1 +L1 L2
41 / 99

Regularizations
regularization encourages sparsity (many coefﬁcients in turn to zero)L1 w
42 / 99

regularizations
What is the expression for ?
But nobody uses it, even . Why?
Lp
=p ∑i
w
p
i
L0
= [ ≠ 0]L0 ∑i
wi
, 0 < p < 1Lp
43 / 99

regularizations
What is the expression for ?
But nobody uses it, even . Why?
Because it is not convex
Lp
=p ∑i
w
p
i
L0
= [ ≠ 0]L0 ∑i
wi
, 0 < p < 1Lp
44 / 99

Regularization summary
important tool to fight overfitting (= poor generalization on a new data)
different modifications for other models
makes it possible to handle really many features
machine learning should detect important features itself
from mathematical point: turning convex problem to strongly convex
(NB: only for linear models)
from practical point: softly limiting the space of parameters
breaks scale-invariance of linear models
45 / 99

SVM and regularization
Width of margin is , so SVM loss is actually:
ﬁrst term is maximizing a margin
second term penalizes samples that are not on the correct side of the
margin
is controlling the trade-off
1
||w||
 = ||w| + C ( , )
1
2
|
2
∑
i
Lhinge xi yi
C
46 / 99

Linear models summary
linear decision function in the core
reduced to optimization problems
losses are additive
stochastic optimizations applicable
can support nonlinear decisions w.r.t. to original features by using kernels
apply regularizations to avoid bad situations and overﬁtting
 = L( , )
∑
i
xi yi
47 / 99

Decision tree
Example: predict outside play based on weather conditions.
49 / 99

Decision tree: binary tree
50 / 99

Decision tree: splitting space
51 / 99

Decision tree
fast & intuitive prediction
but building an optimal decision tree is an NP complete problem
52 / 99

Decision tree
building a tree using a greedy optimization
start from the root (a tree with only one leaf)
each time split one leaf into two
repeat process for children if needed
53 / 99

Decision tree
building a tree using a greedy optimization
start from the root (a tree with only one leaf)
each time split one leaf into two
repeat process for children if needed
need a criterion to select best splitting (feature and threshold)
54 / 99

Splitting criterion
Several impurity functions:
where is a portion of signal events in a leaf, and is a portion of
background events, is number of training events in a leaf.
TreeImpurity = impurity(leaf ) × size(leaf)∑leaf
Misclass.
Gini
Entropy
=
=
=
min(p, 1 − p)
p(1 − p)
− p log p − (1 − p) log(1 − p)
p 1 − p
size(leaf)
55 / 99

Splitting criterion
Impurity as a function of p
56 / 99

Splitting criterion: why not misclassification?
57 / 99

Decision trees for regression
Greedy optimization (minimizing MSE):
Can be rewritten as:
is like an 'impurity' of the leaf:
TreeMSE ∼ ( −
∑
i
yi ŷ
i
)
2
TreeMSE ∼ MSE(leaf) × size(leaf)∑leaf
MSE(leaf)
MSE(leaf) = ( −
1
size(leaf) ∑
i∈leaf
yi ŷ
i
)
2
58 / 99

Decision trees instability
Little variation in training dataset produce different classiﬁcation rule.
64 / 99

Tree keeps splitting until each event is correctly classiﬁed:
65 / 99

Pre-stopping
We can stop the process of splitting by imposing different restrictions:
limit the depth of tree
set minimal number of samples needed to split the leaf
limit the minimal number of samples in leaf
more advanced: maximal number of leaves in tree
66 / 99

Pre-stopping
We can stop the process of splitting by imposing different restrictions:
limit the depth of tree
set minimal number of samples needed to split the leaf
limit the minimal number of samples in leaf
more advanced: maximal number of leaves in tree
Any combinations of rules above is possible.
67 / 99

no pre-stopping max_depth
min # of samples in leaf maximal number of leaves 68 / 99

Post-pruning
When a tree is already built we can try optimize it to simplify formula.
Generally, much slower than pre-stopping.
69 / 99

Decision tree overview
1. Very intuitive algorithm for regression and classiﬁcation
2. Fast prediction
3. Scale-independent
4. Supports multiclassiﬁcation
But
1. Training optimal tree is NP-complex
2. Trained greedily by optimizing Gini index or entropy (fast!)
3. Non-stable
4. Uses only trivial conditions
72 / 99

Missing values in decision trees
If event being predicted lacks feature , we use prior probabilities.x1
73 / 99

Feature importances
Different approaches exist to measure an importance of feature in the ﬁnal
model
Importance of feature quality provided by one feature≠
74 / 99

Feature importances
tree: counting number of splits made over this feature
75 / 99

Feature importances
tree: counting gain in purity (e.g. Gini)
fast and adequate
76 / 99

Feature importances
fast and adequate
model-agnostic recipe: train without one feature,
compare quality on test with/without one feature
requires many evaluations
77 / 99

Feature importances
fast and adequate
model-agnostic recipe: train without one feature,
compare quality on test with/without one feature
requires many evaluations
model-agnostic recipe: feature shuffling
take one column in test dataset and shuffle it. Compare quality with/without
shuffling.
78 / 99

Composition of models
Basic motivation: improve quality of classiﬁcation by reusing strong sides of
different classiﬁers / regressors.
80 / 99

Simple Voting
Averaging predictions
Averaging predicted probabilities
Averaging decision functions
= [−1, +1, +1, +1, −1] ⇒ = 0.6, = 0.4ŷ P+1 P−1
(x) = (x)P±1
1
J
∑J
j=1
p±1,j
D(x) = (x)
1
J
∑J
j=1
dj
81 / 99

Weighted voting
The way to introduce importance of classiﬁers
General case of ensembling:
D(x) = (x)∑j
αj dj
D(x) = f ( (x), (x), …, (x))d1 d2 dJ
82 / 99

Problems
very close base classiﬁers
need to keep variation
and still have good quality of basic classiﬁers
83 / 99

Decision tree reminder
84 / 99

Generating training subset
subsampling
taking ﬁxed part of samples (sampling without replacement)
bagging (Bootstrap AGGregating)
sampling with replacement,
If #generated samples = length of the dataset,
the fraction of unique samples in new dataset is 1 − ∼ 63.2
1
e
85 / 99

Random subspace model (RSM)
Generating subspace of features by taking random subset of features
86 / 99

Random Forest [Leo Breiman, 2001]
Random forest is a composition of decision trees.
Each individual tree is trained on a subset of training data obtained by
bagging samples
taking random subset of features
Predictions of random forest are obtained via simple voting.
87 / 99

data optimal boundary
50 trees
89 / 99

data optimal boundary
50 trees 2000 trees
90 / 99

Overfitting
overfitted (in the sense that predictions for train and test are different)
doesn't overfit: increasing complexity (adding more trees) doesn't spoil a
classifier 93 / 99

Works with features of different nature
Stable to noise in data
94 / 99

Works with features of different nature
Stable to noise in data
From 'Testing 179 Classiﬁers on 121 Datasets'
The classiﬁers most likely to be the bests are the random forest (RF)
versions, the best of which [...] achieves 94.1% of the maximum accuracy
overcoming 90% in the 84.3% of the data sets.
95 / 99

Random Forest overview
Impressively simple
Trees can be trained in parallel
Doesn't overﬁt
96 / 99

Impressively simple
Doesn't overﬁt
Doesn't require much tuning
Effectively only one parameter:
number of features used in each tree
Recommendation: =Nused Nfeatures
‾ ‾‾‾‾‾‾√
97 / 99

Impressively simple
Doesn't overﬁt
Doesn't require much tuning
Effectively only one parameter:
number of features used in each tree
Recommendation:
Hardly interpretable
Trained trees take much space, some kind of pre-stopping is required in
practice
Doesn't ﬁx mistakes done by previous trees
=Nused Nfeatures
‾ ‾‾‾‾‾‾√
98 / 99

Machine Learning Lectures on Classification Models

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Machine Learning Lectures on Classification Models

Similar to Machine Learning Lectures on Classification Models (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Lectures on Classification Models