SlideShare a Scribd company logo
1 of 100
Download to read offline
Machine Learning in High Energy Physics
Lectures 3 & 4
Alex Rogozhnikov
Lund, MLHEP 2016
1 / 99
Recapitulation
classification, regression
kNN classifier and regressor
ROC curve, ROC AUC
1 / 99
Bayes optimal classifier
Given exact distributions' density functions, we can build an optimal classifier
Need to estimate ratio of likelihoods.
= ×
p(y = 1 | x)
p(y = 0 | x)
p(y = 1)
p(y = 0)
p(x | y = 1)
p(x | y = 0)
2 / 99
Density estimation
Histograms
Kernel density estimation
3 / 99
Parametric density estimation
single Gaussian distribution
Gaussian mixtures
EM algorithm
4 / 99
QDA (Quadratic discriminant analysis)
QDA follows generative
approach.
Main assumption is that
distribution of events within
each class is multivariate
gaussian.
5 / 99
Logistic regression
Decision function
Sharp rule:
d(x) =< w, x > +w0
= sgn d(x)ŷ 
6 / 99
Logistic regression
Smooth rule:
Optimizing weights to maximize log-likelihood
d(x) =< w, x > +w0
(x)p+1
(x)p−1
=
=
σ(d(x))
σ(−d(x))
w, w0
 = − ln( ( )) = L( , ) → min
∑
i∈events
py
i
xi
∑
i
xi yi
7 / 99
Logistic loss
Term loss refers to somewhat we are minimizing. Losses typically estimate our
risks, denoted as .
LogLoss penalty for single observation:
Margin is expected to be high for all events.

 = − ln( ( )) = L( , ) → min
∑
i∈events
py
i
xi
∑
i
xi yi
L( , ) = − ln( ( )) =
{
= ln(1 + )xi yi py
i
xi
ln(1 + ),e
−d( )xi
ln(1 + ),e
d( )xi
= +1yi
= −1yi
e
− d( )y
i
xi
d( )yi xi
8 / 99
Logistic loss
is convex
function.
Simple analysis shows
that is sum of
convex functions w.r.t.
to , so the
optimization problem
has
at most one
optimum.
Comment: MLE is not guaranteed to be a good choice.
L( , )xi yi

w
9 / 99
Visualization of logistic regression
10 / 99
Gradient descent
Problem: find to minimize .
Gradient descent:
is step size
(also called shrinkage, learning rate)
w 
w ← w − η
∂
∂w
η
11 / 99
Stochastic gradient descent (SGD)
On each iteration make a step using only one event:
take — random event from training data
 = L( , ) → min
1
N ∑
i
xi yi
i
w ← w − η
∂L( , )xi yi
∂w
12 / 99
Stochastic gradient descent (SGD)
On each iteration make a step using only one event:
take — random event from training data
Each iteration is done much faster, but training process is less stable.
Making smaller steps.
 = L( , ) → min
1
N ∑
i
xi yi
i
w ← w − η
∂L( , )xi yi
∂w
13 / 99
Stochastic gradient descent
We can decrease the learning rate over time: .
At iteration :
This process converges to local minima if:
ηt
t
w ← w − ηt
∂L( , )xi yi
∂w
= ∞, < ∞, > 0
∑
t
ηt
∑
t
η2
t
ηt
14 / 99
SGD with momentum
SGD (and GD) has problems with narrow valley (when hessian is very far from
identity)
Improvement: use momentum which accumulates gradient, 0.9 < γ < 1
v
w
←
←
γv + ηt
∂L( , )xi yi
∂w
w − v
15 / 99
Stochastic optimization methods
16 / 99
Stochastic optimization methods
applied to additive loss function
should be preferred when optimization time is the bottleneck
more advanced modifications exist:
AdaDelta, RMSProp, Adam.
those are using adaptive step size (individually for each sample)
crucial when scale of gradients is very different
in practice predictions are computed using minibatches (small groups of 16
to 256 samples) not on event-by-event basis
 = L( , )
∑
i
xi yi
17 / 99
Polynomial decision rule
d(x) = + +w0
∑
i
wi xi
∑
ij
wij xi xj
18 / 99
Polynomial decision rule
is again a linear model, introduce extended set of features:
and reuse logistic regression.
d(x) = + +w0
∑
i
wi xi
∑
ij
wij xi xj
z = {1} ∪ { ∪ {xi }i xi xj }ij
d(x) = =< w, z >
∑
i
wi zi
19 / 99
Polynomial decision rule
is again a linear model, introduce extended set of features:
and reuse logistic regression.
We can add as one more variable to dataset and forget about term:
d(x) = + +w0
∑
i
wi xi
∑
ij
wij xi xj
z = {1} ∪ { ∪ {xi }i xi xj }ij
d(x) = =< w, z >
∑
i
wi zi
= 1x0 w0
d(x) =< w, x >
20 / 99
Polynomial regression
is done in the same way.
E.g. to fit the polynomial of one
variate, we constrict for each event
a vector of
and train a linear regression.
= (1, x, , , . . )x̃  x
2
x
3
x
d
d( ) = + x + +. . .x̃  w0 w1 w2 x
2
wd x
d
21 / 99
Projecting into the space of higher dimension
SVM with polynomial kernel visualization
22 / 99
Logistic regression overview
classifier based on linear decision rule
training is reduced to convex optimization
stochastic optimization can be used
can handle > 1000 features, but requires regularization (see later)
no interaction between features
other decision rules are achieved by adding new features
23 / 99
Support Vector Machine [Vapnik, Chervonenkis, 1963]
SVM selects a decision rule with maximal possible margin (rule A).
24 / 99
Hinge loss function
SVM uses different loss function:
Margin no penalty
(only signal losses compared on
the plot)
( , ) = max(0, 1 − d( ))Lhinge xi yi yi xi
d( ) > 1 →yi xi
25 / 99
Kernel trick
is a projection operator (which "adds new features").
Assume that optimal (combination of support vectors) and
look for
We need only kernel, not projection operator:
P
d(x) = < w, x > → d(x) = < w, P(x) >new
w = P( )∑i
αi xi
αi
d(x) = < P( ), P(x) = K( , x)
∑
i
αi xi >new
∑
i
αi xi
K(x, ) =< P(x), P( )x̃  x̃  >new
26 / 99
Kernel trick
Polynomial kernel:
projection contains all monomials up to degree .
Popular kernel is a gaussian Radial Basis Function:
Corresponds to projection to the Hilbert space.
Exercise: find a corresponding projection.
K(x, ) = (1 + xx̃  x̃ 
T
)
d
d
K(x, ) =x̃  e
−c||x− |x̃ |
2
27 / 99
SVM + RBF kernel
28 / 99
SVM + RBF kernel
29 / 99
Overfitting
nn with k=1 gives ideal classification of training data.
SVM with small radius of RBF kernel has the same property.
k
30 / 99
Overfitting
Same issues for regression.
Provided high enough
degree, the polynomial can
go through any set of points
and get zero error this way.
31 / 99
There are two definitions of overfitting, which often coincide.
Difference-overfitting
(academical definition)
There is a significant difference in quality of predictions between train and
holdout.
Complexity-overfitting
(practitioners' definition)
Formula has too high complexity (e.g. too many parameters), increasing the
number of parameters drives to lower quality.
32 / 99
Model selection
Given two models, which one should we select?
33 / 99
Model selection
Given two models, which one should we select?
ML is about inference of statistical dependencies, which give us ability to predict
The best model is the model which gives better predictions for new
observations.
Simplest way to control this is to check quality on a holdout — a sample not
used during training (cross-validation).This gives unbiased estimate of quality for
new data.
estimates have variance
multiple testing introduces bias (solution: train + validation + test, like
kaggle)
34 / 99
Difference-overfitting is inessential, provided that we measure quality on a
holdout sample (though easy to check and sometimes helpful).
Complexity-overfitting is a problem — we need to test different parameters for
optimality (more examples through the course).
35 / 99
Difference-overfitting is inessential, provided that we measure quality on a
holdout sample (though easy to check and sometimes helpful).
Complexity-overfitting is a problem — we need to test different parameters for
optimality (more examples through the course).
Don't use distribution comparison to detect
overfitting
36 / 99
-minutes breakn2
37 / 99
Reminder: linear regression
We can use linear function for regression:
Minimize MSE:
Explicit solution:
d(x) =< w, x >
 = (d( ) − → min∑i
xi yi )
2
( ) w =∑i
xi x
T
i
∑i
yi xi
38 / 99
Regularization: motivation
When the number of parameters is high (compared to the number of
observations)
hard to estimate reliably all parameters
linear regression with MSE:
in -dimensional space you can find hyperplane through any points
non-unique solution if
the matrix degenerates
Solution 1: manually decrease dimensionality of the problem
Solution 2: use regularization
d d
n < d
∑i
xi x
T
i
39 / 99
Regularization
When number of parameters in model is high, overfitting is very probable
Solution: add a regularization term to the loss function:
regularization :
regularization:
regularization:
 = L( , ) + → min
1
N ∑
i
xi yi reg
L2 = α |reg ∑j
wj |
2
L1 = β | |reg ∑j
wj
+L1 L2 = α | + β | |reg ∑j
wj |
2
∑j
wj
40 / 99
, – regularizations
Dependence of parameters (components of ) on the regularization (stronger
regularization to the left)
regularization (solid), (dashed)
L2 L1
w
L2 L1 +L1 L2
41 / 99
Regularizations
regularization encourages sparsity (many coefficients in turn to zero)L1 w
42 / 99
regularizations
What is the expression for ?
But nobody uses it, even . Why?
Lp
=p ∑i
w
p
i
L0
= [ ≠ 0]L0 ∑i
wi
, 0 < p < 1Lp
43 / 99
regularizations
What is the expression for ?
But nobody uses it, even . Why?
Because it is not convex
Lp
=p ∑i
w
p
i
L0
= [ ≠ 0]L0 ∑i
wi
, 0 < p < 1Lp
44 / 99
Regularization summary
important tool to fight overfitting (= poor generalization on a new data)
different modifications for other models
makes it possible to handle really many features
machine learning should detect important features itself
from mathematical point: turning convex problem to strongly convex
(NB: only for linear models)
from practical point: softly limiting the space of parameters
breaks scale-invariance of linear models
45 / 99
SVM and regularization
Width of margin is , so SVM loss is actually:
first term is maximizing a margin
second term penalizes samples that are not on the correct side of the
margin
is controlling the trade-off
1
||w||
 = ||w| + C ( , )
1
2
|
2
∑
i
Lhinge xi yi
C
46 / 99
Linear models summary
linear decision function in the core
reduced to optimization problems
losses are additive
stochastic optimizations applicable
can support nonlinear decisions w.r.t. to original features by using kernels
apply regularizations to avoid bad situations and overfitting
 = L( , )
∑
i
xi yi
47 / 99
Decision Trees
48 / 99
Decision tree
Example: predict outside play based on weather conditions.
49 / 99
Decision tree: binary tree
50 / 99
Decision tree: splitting space
51 / 99
Decision tree
fast & intuitive prediction
but building an optimal decision tree is an NP complete problem
52 / 99
Decision tree
fast & intuitive prediction
but building an optimal decision tree is an NP complete problem
building a tree using a greedy optimization
start from the root (a tree with only one leaf)
each time split one leaf into two
repeat process for children if needed
53 / 99
Decision tree
fast & intuitive prediction
but building an optimal decision tree is an NP complete problem
building a tree using a greedy optimization
start from the root (a tree with only one leaf)
each time split one leaf into two
repeat process for children if needed
need a criterion to select best splitting (feature and threshold)
54 / 99
Splitting criterion
Several impurity functions:
where is a portion of signal events in a leaf, and is a portion of
background events, is number of training events in a leaf.
TreeImpurity = impurity(leaf ) × size(leaf)∑leaf
Misclass.
Gini
Entropy
=
=
=
min(p, 1 − p)
p(1 − p)
− p log p − (1 − p) log(1 − p)
p 1 − p
size(leaf)
55 / 99
Splitting criterion
Impurity as a function of p
56 / 99
Splitting criterion: why not misclassification?
57 / 99
Decision trees for regression
Greedy optimization (minimizing MSE):
Can be rewritten as:
is like an 'impurity' of the leaf:
TreeMSE ∼ ( −
∑
i
yi ŷ 
i
)
2
TreeMSE ∼ MSE(leaf) × size(leaf)∑leaf
MSE(leaf)
MSE(leaf) = ( −
1
size(leaf) ∑
i∈leaf
yi ŷ 
i
)
2
58 / 99
59 / 99
60 / 99
61 / 99
62 / 99
63 / 99
Decision trees instability
Little variation in training dataset produce different classification rule.
64 / 99
Tree keeps splitting until each event is correctly classified:
65 / 99
Pre-stopping
We can stop the process of splitting by imposing different restrictions:
limit the depth of tree
set minimal number of samples needed to split the leaf
limit the minimal number of samples in leaf
more advanced: maximal number of leaves in tree
66 / 99
Pre-stopping
We can stop the process of splitting by imposing different restrictions:
limit the depth of tree
set minimal number of samples needed to split the leaf
limit the minimal number of samples in leaf
more advanced: maximal number of leaves in tree
Any combinations of rules above is possible.
67 / 99
no pre-stopping max_depth
min # of samples in leaf maximal number of leaves 68 / 99
Post-pruning
When a tree is already built we can try optimize it to simplify formula.
Generally, much slower than pre-stopping.
69 / 99
70 / 99
71 / 99
Decision tree overview
1. Very intuitive algorithm for regression and classification
2. Fast prediction
3. Scale-independent
4. Supports multiclassification
But
1. Training optimal tree is NP-complex
2. Trained greedily by optimizing Gini index or entropy (fast!)
3. Non-stable
4. Uses only trivial conditions
72 / 99
Missing values in decision trees
If event being predicted lacks feature , we use prior probabilities.x1
73 / 99
Feature importances
Different approaches exist to measure an importance of feature in the final
model
Importance of feature quality provided by one feature≠
74 / 99
Feature importances
tree: counting number of splits made over this feature
75 / 99
Feature importances
tree: counting number of splits made over this feature
tree: counting gain in purity (e.g. Gini)
fast and adequate
76 / 99
Feature importances
tree: counting number of splits made over this feature
tree: counting gain in purity (e.g. Gini)
fast and adequate
model-agnostic recipe: train without one feature,
compare quality on test with/without one feature
requires many evaluations
77 / 99
Feature importances
tree: counting number of splits made over this feature
tree: counting gain in purity (e.g. Gini)
fast and adequate
model-agnostic recipe: train without one feature,
compare quality on test with/without one feature
requires many evaluations
model-agnostic recipe: feature shuffling
take one column in test dataset and shuffle it. Compare quality with/without
shuffling.
78 / 99
Ensembles
79 / 99
Composition of models
Basic motivation: improve quality of classification by reusing strong sides of
different classifiers / regressors.
80 / 99
Simple Voting
Averaging predictions
Averaging predicted probabilities
Averaging decision functions
= [−1, +1, +1, +1, −1] ⇒ = 0.6, = 0.4ŷ  P+1 P−1
(x) = (x)P±1
1
J
∑J
j=1
p±1,j
D(x) = (x)
1
J
∑J
j=1
dj
81 / 99
Weighted voting
The way to introduce importance of classifiers
General case of ensembling:
D(x) = (x)∑j
αj dj
D(x) = f ( (x), (x), …, (x))d1 d2 dJ
82 / 99
Problems
very close base classifiers
need to keep variation
and still have good quality of basic classifiers
83 / 99
Decision tree reminder
84 / 99
Generating training subset
subsampling
taking fixed part of samples (sampling without replacement)
bagging (Bootstrap AGGregating)
sampling with replacement,
If #generated samples = length of the dataset,
the fraction of unique samples in new dataset is 1 − ∼ 63.2
1
e
85 / 99
Random subspace model (RSM)
Generating subspace of features by taking random subset of features
86 / 99
Random Forest [Leo Breiman, 2001]
Random forest is a composition of decision trees.
Each individual tree is trained on a subset of training data obtained by
bagging samples
taking random subset of features
Predictions of random forest are obtained via simple voting.
87 / 99
data optimal boundary
88 / 99
data optimal boundary
50 trees
89 / 99
data optimal boundary
50 trees 2000 trees
90 / 99
Overfitting
91 / 99
Overfitting
92 / 99
Overfitting
overfitted (in the sense that predictions for train and test are different)
doesn't overfit: increasing complexity (adding more trees) doesn't spoil a
classifier 93 / 99
Works with features of different nature
Stable to noise in data
94 / 99
Works with features of different nature
Stable to noise in data
From 'Testing 179 Classifiers on 121 Datasets'
The classifiers most likely to be the bests are the random forest (RF)
versions, the best of which [...] achieves 94.1% of the maximum accuracy
overcoming 90% in the 84.3% of the data sets.
95 / 99
Random Forest overview
Impressively simple
Trees can be trained in parallel
Doesn't overfit
96 / 99
Random Forest overview
Impressively simple
Trees can be trained in parallel
Doesn't overfit
Doesn't require much tuning
Effectively only one parameter:
number of features used in each tree
Recommendation: =Nused Nfeatures
‾ ‾‾‾‾‾‾√
97 / 99
Random Forest overview
Impressively simple
Trees can be trained in parallel
Doesn't overfit
Doesn't require much tuning
Effectively only one parameter:
number of features used in each tree
Recommendation:
Hardly interpretable
Trained trees take much space, some kind of pre-stopping is required in
practice
Doesn't fix mistakes done by previous trees
=Nused Nfeatures
‾ ‾‾‾‾‾‾√
98 / 99
99 / 99

More Related Content

What's hot

(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural NetworkMasahiro Suzuki
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modeljins0618
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic netVivian S. Zhang
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningAndres Hernandez
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function범준 김
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture modelsVu Pham
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMarjan Sterjev
 

What's hot (15)

(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine Learning
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
Detailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss FunctionDetailed Description on Cross Entropy Loss Function
Detailed Description on Cross Entropy Loss Function
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 
Multiclass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark ExamplesMulticlass Logistic Regression: Derivation and Apache Spark Examples
Multiclass Logistic Regression: Derivation and Apache Spark Examples
 

Similar to Machine Learning Lectures on Classification Models

linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.pptMahimMajee
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learningYogendra Singh
 
4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdfBechanYadav4
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationIlya Loshchilov
 
lecture15-regularization.pptx
lecture15-regularization.pptxlecture15-regularization.pptx
lecture15-regularization.pptxsghorai
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines SimplyEmad Nabil
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd12345arjitcs
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learningSteve Nouri
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Ono Shigeru
 

Similar to Machine Learning Lectures on Classification Models (20)

linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.ppt
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf4optmizationtechniques-150308051251-conversion-gate01.pdf
4optmizationtechniques-150308051251-conversion-gate01.pdf
 
Optmization techniques
Optmization techniquesOptmization techniques
Optmization techniques
 
optmizationtechniques.pdf
optmizationtechniques.pdfoptmizationtechniques.pdf
optmizationtechniques.pdf
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
DNN_M3_Optimization.pdf
DNN_M3_Optimization.pdfDNN_M3_Optimization.pdf
DNN_M3_Optimization.pdf
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
poster2
poster2poster2
poster2
 
Session 4 .pdf
Session 4 .pdfSession 4 .pdf
Session 4 .pdf
 
ICPR 2016
ICPR 2016ICPR 2016
ICPR 2016
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective OptimizationDominance-Based Pareto-Surrogate for Multi-Objective Optimization
Dominance-Based Pareto-Surrogate for Multi-Objective Optimization
 
lecture15-regularization.pptx
lecture15-regularization.pptxlecture15-regularization.pptx
lecture15-regularization.pptx
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjdArjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learning
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
 

Recently uploaded

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 

Recently uploaded (20)

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 

Machine Learning Lectures on Classification Models

  • 1. Machine Learning in High Energy Physics Lectures 3 & 4 Alex Rogozhnikov Lund, MLHEP 2016 1 / 99
  • 2. Recapitulation classification, regression kNN classifier and regressor ROC curve, ROC AUC 1 / 99
  • 3. Bayes optimal classifier Given exact distributions' density functions, we can build an optimal classifier Need to estimate ratio of likelihoods. = × p(y = 1 | x) p(y = 0 | x) p(y = 1) p(y = 0) p(x | y = 1) p(x | y = 0) 2 / 99
  • 5. Parametric density estimation single Gaussian distribution Gaussian mixtures EM algorithm 4 / 99
  • 6. QDA (Quadratic discriminant analysis) QDA follows generative approach. Main assumption is that distribution of events within each class is multivariate gaussian. 5 / 99
  • 7. Logistic regression Decision function Sharp rule: d(x) =< w, x > +w0 = sgn d(x)ŷ  6 / 99
  • 8. Logistic regression Smooth rule: Optimizing weights to maximize log-likelihood d(x) =< w, x > +w0 (x)p+1 (x)p−1 = = σ(d(x)) σ(−d(x)) w, w0  = − ln( ( )) = L( , ) → min ∑ i∈events py i xi ∑ i xi yi 7 / 99
  • 9. Logistic loss Term loss refers to somewhat we are minimizing. Losses typically estimate our risks, denoted as . LogLoss penalty for single observation: Margin is expected to be high for all events.   = − ln( ( )) = L( , ) → min ∑ i∈events py i xi ∑ i xi yi L( , ) = − ln( ( )) = { = ln(1 + )xi yi py i xi ln(1 + ),e −d( )xi ln(1 + ),e d( )xi = +1yi = −1yi e − d( )y i xi d( )yi xi 8 / 99
  • 10. Logistic loss is convex function. Simple analysis shows that is sum of convex functions w.r.t. to , so the optimization problem has at most one optimum. Comment: MLE is not guaranteed to be a good choice. L( , )xi yi  w 9 / 99
  • 11. Visualization of logistic regression 10 / 99
  • 12. Gradient descent Problem: find to minimize . Gradient descent: is step size (also called shrinkage, learning rate) w  w ← w − η ∂ ∂w η 11 / 99
  • 13. Stochastic gradient descent (SGD) On each iteration make a step using only one event: take — random event from training data  = L( , ) → min 1 N ∑ i xi yi i w ← w − η ∂L( , )xi yi ∂w 12 / 99
  • 14. Stochastic gradient descent (SGD) On each iteration make a step using only one event: take — random event from training data Each iteration is done much faster, but training process is less stable. Making smaller steps.  = L( , ) → min 1 N ∑ i xi yi i w ← w − η ∂L( , )xi yi ∂w 13 / 99
  • 15. Stochastic gradient descent We can decrease the learning rate over time: . At iteration : This process converges to local minima if: ηt t w ← w − ηt ∂L( , )xi yi ∂w = ∞, < ∞, > 0 ∑ t ηt ∑ t η2 t ηt 14 / 99
  • 16. SGD with momentum SGD (and GD) has problems with narrow valley (when hessian is very far from identity) Improvement: use momentum which accumulates gradient, 0.9 < γ < 1 v w ← ← γv + ηt ∂L( , )xi yi ∂w w − v 15 / 99
  • 18. Stochastic optimization methods applied to additive loss function should be preferred when optimization time is the bottleneck more advanced modifications exist: AdaDelta, RMSProp, Adam. those are using adaptive step size (individually for each sample) crucial when scale of gradients is very different in practice predictions are computed using minibatches (small groups of 16 to 256 samples) not on event-by-event basis  = L( , ) ∑ i xi yi 17 / 99
  • 19. Polynomial decision rule d(x) = + +w0 ∑ i wi xi ∑ ij wij xi xj 18 / 99
  • 20. Polynomial decision rule is again a linear model, introduce extended set of features: and reuse logistic regression. d(x) = + +w0 ∑ i wi xi ∑ ij wij xi xj z = {1} ∪ { ∪ {xi }i xi xj }ij d(x) = =< w, z > ∑ i wi zi 19 / 99
  • 21. Polynomial decision rule is again a linear model, introduce extended set of features: and reuse logistic regression. We can add as one more variable to dataset and forget about term: d(x) = + +w0 ∑ i wi xi ∑ ij wij xi xj z = {1} ∪ { ∪ {xi }i xi xj }ij d(x) = =< w, z > ∑ i wi zi = 1x0 w0 d(x) =< w, x > 20 / 99
  • 22. Polynomial regression is done in the same way. E.g. to fit the polynomial of one variate, we constrict for each event a vector of and train a linear regression. = (1, x, , , . . )x̃  x 2 x 3 x d d( ) = + x + +. . .x̃  w0 w1 w2 x 2 wd x d 21 / 99
  • 23. Projecting into the space of higher dimension SVM with polynomial kernel visualization 22 / 99
  • 24. Logistic regression overview classifier based on linear decision rule training is reduced to convex optimization stochastic optimization can be used can handle > 1000 features, but requires regularization (see later) no interaction between features other decision rules are achieved by adding new features 23 / 99
  • 25. Support Vector Machine [Vapnik, Chervonenkis, 1963] SVM selects a decision rule with maximal possible margin (rule A). 24 / 99
  • 26. Hinge loss function SVM uses different loss function: Margin no penalty (only signal losses compared on the plot) ( , ) = max(0, 1 − d( ))Lhinge xi yi yi xi d( ) > 1 →yi xi 25 / 99
  • 27. Kernel trick is a projection operator (which "adds new features"). Assume that optimal (combination of support vectors) and look for We need only kernel, not projection operator: P d(x) = < w, x > → d(x) = < w, P(x) >new w = P( )∑i αi xi αi d(x) = < P( ), P(x) = K( , x) ∑ i αi xi >new ∑ i αi xi K(x, ) =< P(x), P( )x̃  x̃  >new 26 / 99
  • 28. Kernel trick Polynomial kernel: projection contains all monomials up to degree . Popular kernel is a gaussian Radial Basis Function: Corresponds to projection to the Hilbert space. Exercise: find a corresponding projection. K(x, ) = (1 + xx̃  x̃  T ) d d K(x, ) =x̃  e −c||x− |x̃ | 2 27 / 99
  • 29. SVM + RBF kernel 28 / 99
  • 30. SVM + RBF kernel 29 / 99
  • 31. Overfitting nn with k=1 gives ideal classification of training data. SVM with small radius of RBF kernel has the same property. k 30 / 99
  • 32. Overfitting Same issues for regression. Provided high enough degree, the polynomial can go through any set of points and get zero error this way. 31 / 99
  • 33. There are two definitions of overfitting, which often coincide. Difference-overfitting (academical definition) There is a significant difference in quality of predictions between train and holdout. Complexity-overfitting (practitioners' definition) Formula has too high complexity (e.g. too many parameters), increasing the number of parameters drives to lower quality. 32 / 99
  • 34. Model selection Given two models, which one should we select? 33 / 99
  • 35. Model selection Given two models, which one should we select? ML is about inference of statistical dependencies, which give us ability to predict The best model is the model which gives better predictions for new observations. Simplest way to control this is to check quality on a holdout — a sample not used during training (cross-validation).This gives unbiased estimate of quality for new data. estimates have variance multiple testing introduces bias (solution: train + validation + test, like kaggle) 34 / 99
  • 36. Difference-overfitting is inessential, provided that we measure quality on a holdout sample (though easy to check and sometimes helpful). Complexity-overfitting is a problem — we need to test different parameters for optimality (more examples through the course). 35 / 99
  • 37. Difference-overfitting is inessential, provided that we measure quality on a holdout sample (though easy to check and sometimes helpful). Complexity-overfitting is a problem — we need to test different parameters for optimality (more examples through the course). Don't use distribution comparison to detect overfitting 36 / 99
  • 39. Reminder: linear regression We can use linear function for regression: Minimize MSE: Explicit solution: d(x) =< w, x >  = (d( ) − → min∑i xi yi ) 2 ( ) w =∑i xi x T i ∑i yi xi 38 / 99
  • 40. Regularization: motivation When the number of parameters is high (compared to the number of observations) hard to estimate reliably all parameters linear regression with MSE: in -dimensional space you can find hyperplane through any points non-unique solution if the matrix degenerates Solution 1: manually decrease dimensionality of the problem Solution 2: use regularization d d n < d ∑i xi x T i 39 / 99
  • 41. Regularization When number of parameters in model is high, overfitting is very probable Solution: add a regularization term to the loss function: regularization : regularization: regularization:  = L( , ) + → min 1 N ∑ i xi yi reg L2 = α |reg ∑j wj | 2 L1 = β | |reg ∑j wj +L1 L2 = α | + β | |reg ∑j wj | 2 ∑j wj 40 / 99
  • 42. , – regularizations Dependence of parameters (components of ) on the regularization (stronger regularization to the left) regularization (solid), (dashed) L2 L1 w L2 L1 +L1 L2 41 / 99
  • 43. Regularizations regularization encourages sparsity (many coefficients in turn to zero)L1 w 42 / 99
  • 44. regularizations What is the expression for ? But nobody uses it, even . Why? Lp =p ∑i w p i L0 = [ ≠ 0]L0 ∑i wi , 0 < p < 1Lp 43 / 99
  • 45. regularizations What is the expression for ? But nobody uses it, even . Why? Because it is not convex Lp =p ∑i w p i L0 = [ ≠ 0]L0 ∑i wi , 0 < p < 1Lp 44 / 99
  • 46. Regularization summary important tool to fight overfitting (= poor generalization on a new data) different modifications for other models makes it possible to handle really many features machine learning should detect important features itself from mathematical point: turning convex problem to strongly convex (NB: only for linear models) from practical point: softly limiting the space of parameters breaks scale-invariance of linear models 45 / 99
  • 47. SVM and regularization Width of margin is , so SVM loss is actually: first term is maximizing a margin second term penalizes samples that are not on the correct side of the margin is controlling the trade-off 1 ||w||  = ||w| + C ( , ) 1 2 | 2 ∑ i Lhinge xi yi C 46 / 99
  • 48. Linear models summary linear decision function in the core reduced to optimization problems losses are additive stochastic optimizations applicable can support nonlinear decisions w.r.t. to original features by using kernels apply regularizations to avoid bad situations and overfitting  = L( , ) ∑ i xi yi 47 / 99
  • 50. Decision tree Example: predict outside play based on weather conditions. 49 / 99
  • 51. Decision tree: binary tree 50 / 99
  • 52. Decision tree: splitting space 51 / 99
  • 53. Decision tree fast & intuitive prediction but building an optimal decision tree is an NP complete problem 52 / 99
  • 54. Decision tree fast & intuitive prediction but building an optimal decision tree is an NP complete problem building a tree using a greedy optimization start from the root (a tree with only one leaf) each time split one leaf into two repeat process for children if needed 53 / 99
  • 55. Decision tree fast & intuitive prediction but building an optimal decision tree is an NP complete problem building a tree using a greedy optimization start from the root (a tree with only one leaf) each time split one leaf into two repeat process for children if needed need a criterion to select best splitting (feature and threshold) 54 / 99
  • 56. Splitting criterion Several impurity functions: where is a portion of signal events in a leaf, and is a portion of background events, is number of training events in a leaf. TreeImpurity = impurity(leaf ) × size(leaf)∑leaf Misclass. Gini Entropy = = = min(p, 1 − p) p(1 − p) − p log p − (1 − p) log(1 − p) p 1 − p size(leaf) 55 / 99
  • 57. Splitting criterion Impurity as a function of p 56 / 99
  • 58. Splitting criterion: why not misclassification? 57 / 99
  • 59. Decision trees for regression Greedy optimization (minimizing MSE): Can be rewritten as: is like an 'impurity' of the leaf: TreeMSE ∼ ( − ∑ i yi ŷ  i ) 2 TreeMSE ∼ MSE(leaf) × size(leaf)∑leaf MSE(leaf) MSE(leaf) = ( − 1 size(leaf) ∑ i∈leaf yi ŷ  i ) 2 58 / 99
  • 65. Decision trees instability Little variation in training dataset produce different classification rule. 64 / 99
  • 66. Tree keeps splitting until each event is correctly classified: 65 / 99
  • 67. Pre-stopping We can stop the process of splitting by imposing different restrictions: limit the depth of tree set minimal number of samples needed to split the leaf limit the minimal number of samples in leaf more advanced: maximal number of leaves in tree 66 / 99
  • 68. Pre-stopping We can stop the process of splitting by imposing different restrictions: limit the depth of tree set minimal number of samples needed to split the leaf limit the minimal number of samples in leaf more advanced: maximal number of leaves in tree Any combinations of rules above is possible. 67 / 99
  • 69. no pre-stopping max_depth min # of samples in leaf maximal number of leaves 68 / 99
  • 70. Post-pruning When a tree is already built we can try optimize it to simplify formula. Generally, much slower than pre-stopping. 69 / 99
  • 73. Decision tree overview 1. Very intuitive algorithm for regression and classification 2. Fast prediction 3. Scale-independent 4. Supports multiclassification But 1. Training optimal tree is NP-complex 2. Trained greedily by optimizing Gini index or entropy (fast!) 3. Non-stable 4. Uses only trivial conditions 72 / 99
  • 74. Missing values in decision trees If event being predicted lacks feature , we use prior probabilities.x1 73 / 99
  • 75. Feature importances Different approaches exist to measure an importance of feature in the final model Importance of feature quality provided by one feature≠ 74 / 99
  • 76. Feature importances tree: counting number of splits made over this feature 75 / 99
  • 77. Feature importances tree: counting number of splits made over this feature tree: counting gain in purity (e.g. Gini) fast and adequate 76 / 99
  • 78. Feature importances tree: counting number of splits made over this feature tree: counting gain in purity (e.g. Gini) fast and adequate model-agnostic recipe: train without one feature, compare quality on test with/without one feature requires many evaluations 77 / 99
  • 79. Feature importances tree: counting number of splits made over this feature tree: counting gain in purity (e.g. Gini) fast and adequate model-agnostic recipe: train without one feature, compare quality on test with/without one feature requires many evaluations model-agnostic recipe: feature shuffling take one column in test dataset and shuffle it. Compare quality with/without shuffling. 78 / 99
  • 81. Composition of models Basic motivation: improve quality of classification by reusing strong sides of different classifiers / regressors. 80 / 99
  • 82. Simple Voting Averaging predictions Averaging predicted probabilities Averaging decision functions = [−1, +1, +1, +1, −1] ⇒ = 0.6, = 0.4ŷ  P+1 P−1 (x) = (x)P±1 1 J ∑J j=1 p±1,j D(x) = (x) 1 J ∑J j=1 dj 81 / 99
  • 83. Weighted voting The way to introduce importance of classifiers General case of ensembling: D(x) = (x)∑j αj dj D(x) = f ( (x), (x), …, (x))d1 d2 dJ 82 / 99
  • 84. Problems very close base classifiers need to keep variation and still have good quality of basic classifiers 83 / 99
  • 86. Generating training subset subsampling taking fixed part of samples (sampling without replacement) bagging (Bootstrap AGGregating) sampling with replacement, If #generated samples = length of the dataset, the fraction of unique samples in new dataset is 1 − ∼ 63.2 1 e 85 / 99
  • 87. Random subspace model (RSM) Generating subspace of features by taking random subset of features 86 / 99
  • 88. Random Forest [Leo Breiman, 2001] Random forest is a composition of decision trees. Each individual tree is trained on a subset of training data obtained by bagging samples taking random subset of features Predictions of random forest are obtained via simple voting. 87 / 99
  • 90. data optimal boundary 50 trees 89 / 99
  • 91. data optimal boundary 50 trees 2000 trees 90 / 99
  • 94. Overfitting overfitted (in the sense that predictions for train and test are different) doesn't overfit: increasing complexity (adding more trees) doesn't spoil a classifier 93 / 99
  • 95. Works with features of different nature Stable to noise in data 94 / 99
  • 96. Works with features of different nature Stable to noise in data From 'Testing 179 Classifiers on 121 Datasets' The classifiers most likely to be the bests are the random forest (RF) versions, the best of which [...] achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. 95 / 99
  • 97. Random Forest overview Impressively simple Trees can be trained in parallel Doesn't overfit 96 / 99
  • 98. Random Forest overview Impressively simple Trees can be trained in parallel Doesn't overfit Doesn't require much tuning Effectively only one parameter: number of features used in each tree Recommendation: =Nused Nfeatures ‾ ‾‾‾‾‾‾√ 97 / 99
  • 99. Random Forest overview Impressively simple Trees can be trained in parallel Doesn't overfit Doesn't require much tuning Effectively only one parameter: number of features used in each tree Recommendation: Hardly interpretable Trained trees take much space, some kind of pre-stopping is required in practice Doesn't fix mistakes done by previous trees =Nused Nfeatures ‾ ‾‾‾‾‾‾√ 98 / 99