This document provides an overview of machine learning techniques for classification and regression, including decision trees, linear models, and support vector machines. It discusses key concepts like overfitting, regularization, and model selection. For decision trees, it explains how they work by binary splitting of space, common splitting criteria like entropy and Gini impurity, and how trees are built using a greedy optimization approach. Linear models like logistic regression and support vector machines are covered, along with techniques like kernels, regularization, and stochastic optimization. The importance of testing on a holdout set to avoid overfitting is emphasized.
3. Bayes optimal classifier
Given exact distributions' density functions, we can build an optimal classifier
Need to estimate ratio of likelihoods.
= ×
p(y = 1 | x)
p(y = 0 | x)
p(y = 1)
p(y = 0)
p(x | y = 1)
p(x | y = 0)
2 / 99
6. QDA (Quadratic discriminant analysis)
QDA follows generative
approach.
Main assumption is that
distribution of events within
each class is multivariate
gaussian.
5 / 99
8. Logistic regression
Smooth rule:
Optimizing weights to maximize log-likelihood
d(x) =< w, x > +w0
(x)p+1
(x)p−1
=
=
σ(d(x))
σ(−d(x))
w, w0
= − ln( ( )) = L( , ) → min
∑
i∈events
py
i
xi
∑
i
xi yi
7 / 99
9. Logistic loss
Term loss refers to somewhat we are minimizing. Losses typically estimate our
risks, denoted as .
LogLoss penalty for single observation:
Margin is expected to be high for all events.
= − ln( ( )) = L( , ) → min
∑
i∈events
py
i
xi
∑
i
xi yi
L( , ) = − ln( ( )) =
{
= ln(1 + )xi yi py
i
xi
ln(1 + ),e
−d( )xi
ln(1 + ),e
d( )xi
= +1yi
= −1yi
e
− d( )y
i
xi
d( )yi xi
8 / 99
10. Logistic loss
is convex
function.
Simple analysis shows
that is sum of
convex functions w.r.t.
to , so the
optimization problem
has
at most one
optimum.
Comment: MLE is not guaranteed to be a good choice.
L( , )xi yi
w
9 / 99
12. Gradient descent
Problem: find to minimize .
Gradient descent:
is step size
(also called shrinkage, learning rate)
w
w ← w − η
∂
∂w
η
11 / 99
13. Stochastic gradient descent (SGD)
On each iteration make a step using only one event:
take — random event from training data
= L( , ) → min
1
N ∑
i
xi yi
i
w ← w − η
∂L( , )xi yi
∂w
12 / 99
14. Stochastic gradient descent (SGD)
On each iteration make a step using only one event:
take — random event from training data
Each iteration is done much faster, but training process is less stable.
Making smaller steps.
= L( , ) → min
1
N ∑
i
xi yi
i
w ← w − η
∂L( , )xi yi
∂w
13 / 99
15. Stochastic gradient descent
We can decrease the learning rate over time: .
At iteration :
This process converges to local minima if:
ηt
t
w ← w − ηt
∂L( , )xi yi
∂w
= ∞, < ∞, > 0
∑
t
ηt
∑
t
η2
t
ηt
14 / 99
16. SGD with momentum
SGD (and GD) has problems with narrow valley (when hessian is very far from
identity)
Improvement: use momentum which accumulates gradient, 0.9 < γ < 1
v
w
←
←
γv + ηt
∂L( , )xi yi
∂w
w − v
15 / 99
18. Stochastic optimization methods
applied to additive loss function
should be preferred when optimization time is the bottleneck
more advanced modifications exist:
AdaDelta, RMSProp, Adam.
those are using adaptive step size (individually for each sample)
crucial when scale of gradients is very different
in practice predictions are computed using minibatches (small groups of 16
to 256 samples) not on event-by-event basis
= L( , )
∑
i
xi yi
17 / 99
20. Polynomial decision rule
is again a linear model, introduce extended set of features:
and reuse logistic regression.
d(x) = + +w0
∑
i
wi xi
∑
ij
wij xi xj
z = {1} ∪ { ∪ {xi }i xi xj }ij
d(x) = =< w, z >
∑
i
wi zi
19 / 99
21. Polynomial decision rule
is again a linear model, introduce extended set of features:
and reuse logistic regression.
We can add as one more variable to dataset and forget about term:
d(x) = + +w0
∑
i
wi xi
∑
ij
wij xi xj
z = {1} ∪ { ∪ {xi }i xi xj }ij
d(x) = =< w, z >
∑
i
wi zi
= 1x0 w0
d(x) =< w, x >
20 / 99
22. Polynomial regression
is done in the same way.
E.g. to fit the polynomial of one
variate, we constrict for each event
a vector of
and train a linear regression.
= (1, x, , , . . )x̃ x
2
x
3
x
d
d( ) = + x + +. . .x̃ w0 w1 w2 x
2
wd x
d
21 / 99
23. Projecting into the space of higher dimension
SVM with polynomial kernel visualization
22 / 99
24. Logistic regression overview
classifier based on linear decision rule
training is reduced to convex optimization
stochastic optimization can be used
can handle > 1000 features, but requires regularization (see later)
no interaction between features
other decision rules are achieved by adding new features
23 / 99
25. Support Vector Machine [Vapnik, Chervonenkis, 1963]
SVM selects a decision rule with maximal possible margin (rule A).
24 / 99
26. Hinge loss function
SVM uses different loss function:
Margin no penalty
(only signal losses compared on
the plot)
( , ) = max(0, 1 − d( ))Lhinge xi yi yi xi
d( ) > 1 →yi xi
25 / 99
27. Kernel trick
is a projection operator (which "adds new features").
Assume that optimal (combination of support vectors) and
look for
We need only kernel, not projection operator:
P
d(x) = < w, x > → d(x) = < w, P(x) >new
w = P( )∑i
αi xi
αi
d(x) = < P( ), P(x) = K( , x)
∑
i
αi xi >new
∑
i
αi xi
K(x, ) =< P(x), P( )x̃ x̃ >new
26 / 99
28. Kernel trick
Polynomial kernel:
projection contains all monomials up to degree .
Popular kernel is a gaussian Radial Basis Function:
Corresponds to projection to the Hilbert space.
Exercise: find a corresponding projection.
K(x, ) = (1 + xx̃ x̃
T
)
d
d
K(x, ) =x̃ e
−c||x− |x̃ |
2
27 / 99
31. Overfitting
nn with k=1 gives ideal classification of training data.
SVM with small radius of RBF kernel has the same property.
k
30 / 99
32. Overfitting
Same issues for regression.
Provided high enough
degree, the polynomial can
go through any set of points
and get zero error this way.
31 / 99
33. There are two definitions of overfitting, which often coincide.
Difference-overfitting
(academical definition)
There is a significant difference in quality of predictions between train and
holdout.
Complexity-overfitting
(practitioners' definition)
Formula has too high complexity (e.g. too many parameters), increasing the
number of parameters drives to lower quality.
32 / 99
35. Model selection
Given two models, which one should we select?
ML is about inference of statistical dependencies, which give us ability to predict
The best model is the model which gives better predictions for new
observations.
Simplest way to control this is to check quality on a holdout — a sample not
used during training (cross-validation).This gives unbiased estimate of quality for
new data.
estimates have variance
multiple testing introduces bias (solution: train + validation + test, like
kaggle)
34 / 99
36. Difference-overfitting is inessential, provided that we measure quality on a
holdout sample (though easy to check and sometimes helpful).
Complexity-overfitting is a problem — we need to test different parameters for
optimality (more examples through the course).
35 / 99
37. Difference-overfitting is inessential, provided that we measure quality on a
holdout sample (though easy to check and sometimes helpful).
Complexity-overfitting is a problem — we need to test different parameters for
optimality (more examples through the course).
Don't use distribution comparison to detect
overfitting
36 / 99
39. Reminder: linear regression
We can use linear function for regression:
Minimize MSE:
Explicit solution:
d(x) =< w, x >
= (d( ) − → min∑i
xi yi )
2
( ) w =∑i
xi x
T
i
∑i
yi xi
38 / 99
40. Regularization: motivation
When the number of parameters is high (compared to the number of
observations)
hard to estimate reliably all parameters
linear regression with MSE:
in -dimensional space you can find hyperplane through any points
non-unique solution if
the matrix degenerates
Solution 1: manually decrease dimensionality of the problem
Solution 2: use regularization
d d
n < d
∑i
xi x
T
i
39 / 99
41. Regularization
When number of parameters in model is high, overfitting is very probable
Solution: add a regularization term to the loss function:
regularization :
regularization:
regularization:
= L( , ) + → min
1
N ∑
i
xi yi reg
L2 = α |reg ∑j
wj |
2
L1 = β | |reg ∑j
wj
+L1 L2 = α | + β | |reg ∑j
wj |
2
∑j
wj
40 / 99
42. , – regularizations
Dependence of parameters (components of ) on the regularization (stronger
regularization to the left)
regularization (solid), (dashed)
L2 L1
w
L2 L1 +L1 L2
41 / 99
44. regularizations
What is the expression for ?
But nobody uses it, even . Why?
Lp
=p ∑i
w
p
i
L0
= [ ≠ 0]L0 ∑i
wi
, 0 < p < 1Lp
43 / 99
45. regularizations
What is the expression for ?
But nobody uses it, even . Why?
Because it is not convex
Lp
=p ∑i
w
p
i
L0
= [ ≠ 0]L0 ∑i
wi
, 0 < p < 1Lp
44 / 99
46. Regularization summary
important tool to fight overfitting (= poor generalization on a new data)
different modifications for other models
makes it possible to handle really many features
machine learning should detect important features itself
from mathematical point: turning convex problem to strongly convex
(NB: only for linear models)
from practical point: softly limiting the space of parameters
breaks scale-invariance of linear models
45 / 99
47. SVM and regularization
Width of margin is , so SVM loss is actually:
first term is maximizing a margin
second term penalizes samples that are not on the correct side of the
margin
is controlling the trade-off
1
||w||
= ||w| + C ( , )
1
2
|
2
∑
i
Lhinge xi yi
C
46 / 99
48. Linear models summary
linear decision function in the core
reduced to optimization problems
losses are additive
stochastic optimizations applicable
can support nonlinear decisions w.r.t. to original features by using kernels
apply regularizations to avoid bad situations and overfitting
= L( , )
∑
i
xi yi
47 / 99
53. Decision tree
fast & intuitive prediction
but building an optimal decision tree is an NP complete problem
52 / 99
54. Decision tree
fast & intuitive prediction
but building an optimal decision tree is an NP complete problem
building a tree using a greedy optimization
start from the root (a tree with only one leaf)
each time split one leaf into two
repeat process for children if needed
53 / 99
55. Decision tree
fast & intuitive prediction
but building an optimal decision tree is an NP complete problem
building a tree using a greedy optimization
start from the root (a tree with only one leaf)
each time split one leaf into two
repeat process for children if needed
need a criterion to select best splitting (feature and threshold)
54 / 99
56. Splitting criterion
Several impurity functions:
where is a portion of signal events in a leaf, and is a portion of
background events, is number of training events in a leaf.
TreeImpurity = impurity(leaf ) × size(leaf)∑leaf
Misclass.
Gini
Entropy
=
=
=
min(p, 1 − p)
p(1 − p)
− p log p − (1 − p) log(1 − p)
p 1 − p
size(leaf)
55 / 99
59. Decision trees for regression
Greedy optimization (minimizing MSE):
Can be rewritten as:
is like an 'impurity' of the leaf:
TreeMSE ∼ ( −
∑
i
yi ŷ
i
)
2
TreeMSE ∼ MSE(leaf) × size(leaf)∑leaf
MSE(leaf)
MSE(leaf) = ( −
1
size(leaf) ∑
i∈leaf
yi ŷ
i
)
2
58 / 99
67. Pre-stopping
We can stop the process of splitting by imposing different restrictions:
limit the depth of tree
set minimal number of samples needed to split the leaf
limit the minimal number of samples in leaf
more advanced: maximal number of leaves in tree
66 / 99
68. Pre-stopping
We can stop the process of splitting by imposing different restrictions:
limit the depth of tree
set minimal number of samples needed to split the leaf
limit the minimal number of samples in leaf
more advanced: maximal number of leaves in tree
Any combinations of rules above is possible.
67 / 99
73. Decision tree overview
1. Very intuitive algorithm for regression and classification
2. Fast prediction
3. Scale-independent
4. Supports multiclassification
But
1. Training optimal tree is NP-complex
2. Trained greedily by optimizing Gini index or entropy (fast!)
3. Non-stable
4. Uses only trivial conditions
72 / 99
74. Missing values in decision trees
If event being predicted lacks feature , we use prior probabilities.x1
73 / 99
75. Feature importances
Different approaches exist to measure an importance of feature in the final
model
Importance of feature quality provided by one feature≠
74 / 99
77. Feature importances
tree: counting number of splits made over this feature
tree: counting gain in purity (e.g. Gini)
fast and adequate
76 / 99
78. Feature importances
tree: counting number of splits made over this feature
tree: counting gain in purity (e.g. Gini)
fast and adequate
model-agnostic recipe: train without one feature,
compare quality on test with/without one feature
requires many evaluations
77 / 99
79. Feature importances
tree: counting number of splits made over this feature
tree: counting gain in purity (e.g. Gini)
fast and adequate
model-agnostic recipe: train without one feature,
compare quality on test with/without one feature
requires many evaluations
model-agnostic recipe: feature shuffling
take one column in test dataset and shuffle it. Compare quality with/without
shuffling.
78 / 99
83. Weighted voting
The way to introduce importance of classifiers
General case of ensembling:
D(x) = (x)∑j
αj dj
D(x) = f ( (x), (x), …, (x))d1 d2 dJ
82 / 99
84. Problems
very close base classifiers
need to keep variation
and still have good quality of basic classifiers
83 / 99
86. Generating training subset
subsampling
taking fixed part of samples (sampling without replacement)
bagging (Bootstrap AGGregating)
sampling with replacement,
If #generated samples = length of the dataset,
the fraction of unique samples in new dataset is 1 − ∼ 63.2
1
e
85 / 99
87. Random subspace model (RSM)
Generating subspace of features by taking random subset of features
86 / 99
88. Random Forest [Leo Breiman, 2001]
Random forest is a composition of decision trees.
Each individual tree is trained on a subset of training data obtained by
bagging samples
taking random subset of features
Predictions of random forest are obtained via simple voting.
87 / 99
94. Overfitting
overfitted (in the sense that predictions for train and test are different)
doesn't overfit: increasing complexity (adding more trees) doesn't spoil a
classifier 93 / 99
96. Works with features of different nature
Stable to noise in data
From 'Testing 179 Classifiers on 121 Datasets'
The classifiers most likely to be the bests are the random forest (RF)
versions, the best of which [...] achieves 94.1% of the maximum accuracy
overcoming 90% in the 84.3% of the data sets.
95 / 99
98. Random Forest overview
Impressively simple
Trees can be trained in parallel
Doesn't overfit
Doesn't require much tuning
Effectively only one parameter:
number of features used in each tree
Recommendation: =Nused Nfeatures
‾ ‾‾‾‾‾‾√
97 / 99
99. Random Forest overview
Impressively simple
Trees can be trained in parallel
Doesn't overfit
Doesn't require much tuning
Effectively only one parameter:
number of features used in each tree
Recommendation:
Hardly interpretable
Trained trees take much space, some kind of pre-stopping is required in
practice
Doesn't fix mistakes done by previous trees
=Nused Nfeatures
‾ ‾‾‾‾‾‾√
98 / 99