L2. Evaluating Machine Learning Algorithms I

Evaluating Machine Learning Models I
Cèsar Ferri Ramírez
Universitat Politècnica de València

2!
q  Machine Learning Tasks
q  Classiﬁcation
v  Imbalanced problems, probabilistic classiﬁers, rankers
q  Regression
q  Unsupervised Learning
q  Lessons learned
Outline

3!
q  Supervised learning: The problem is presented with
example inputs and their desired outputs and the goal is
to learn a general rule that maps inputs to outputs.
o  Classiﬁcation: Output is categorical
o  Regression: Output is numerical
q  Unsupervised learning: No labels are given to the
learning algorithm, leaving it on its own to ﬁnd structure
in its input.
o  Clustering, association rules…
q  Reinforcement learning: A computer program interacts
with a dynamic environment in which it must perform a
certain goal: Driving a vehicle or videogames.
Machine Learning Tasks

4!
q  Classification: problem of identifying to which of a
set of categories a new observation belongs, on the
basis of a training set of data containing
observations whose category membership is
known.
o  Spam filters
o  Face identification
o  Diagnosis of patients
Evaluation of Classifiers

5!
q  Given a set S of n instances, we define
classification error:
Evaluation of Classifiers
∑∈
∂=
Sx
S xhxf
n
herror ))(),((
1
)(
Where δ(a,b)=0 if a=b and 1otherwise.
Predicted
class(h(x))
Actual class (f(x)) Error
Buy Buy No
No Buy Buy Yes
Buy No Buy Yes
Buy Buy No
No Buy No Buy No
No Buy Buy Yes
No Buy No Buy No
Buy Buy No
Buy Buy No
No Buy No Buy No
Error = 3/10 = 0.3
Mistakes/ Total

6!
q  Common solution:
o  Split between training and test data
Split the data
training
test
Models
Evaluation
Best model
∑∈
−=
Sx
S xhxf
n
herror 2
))()((
1
)(
data
Algorithms
What if there is not much data available?
GOLDEN RULE: Never use the same example
for training the model and evaluating it!!

7!
q  Too much training data: poor evaluation
q  Too much test data: poor training
q  Can we have more training data and more test data
without breaking the golden rule?
o  Repeat the experiment!
ü  Bootstrap: we perform n samples (with repetition) and test
with the rest.
ü  Cross validation: Data is split in n folds of equal size.
Taking the most of the data

8!
q  What dataset do we use to estimate all previous
metrics?
o  If we use all data to train the models and evaluate them, we
get overoptimistic models:
ü  Over-fitting:
o  If we try to compensate by generalising the model (e.g.,
pruning a tree), we may get:
ü  Under-fitting:
o  How can we find a trade-off?
Overfitting?

9!
q  Confusion (contingency) matrix:
o  We can observe how errors are distributed.
Confusion Matrix
c Buy No Buy
Buy 4 1
No Buy 2 3
Actual
Pred.

10!
q  For two classes:
Confusion Matrix
+ -
+
-
TP
FN
FP
TN
actual
predicted
TP+FN FP+TN
true positive false positive
false negative true negative
Accuracy= 𝑇 𝑃+ 𝑇𝑁/𝑁 ! Error=1-Accuracy= 𝐹 𝑃+ 𝐹𝑁/𝑁 !

11!
Confusion Matrix
+ -
+
-
TP
FN
FP
TN
actual
predicted
TP+FN FP+TN
TPRate, Sensitivity= 𝑇 𝑃/𝑇𝑃+ 𝐹𝑁 !
TNRate, Speciﬁcity = 𝑇 𝑁/𝐹𝑃+ 𝑇𝑁 !

12!
Confusion Matrix
+ -
+
-
TP
FN
FP
TN
actual
predicted
TP+FN FP+TN
TPRate, Sensitivity, Recall= 𝑇 𝑃/𝑇𝑃+ 𝐹𝑁 !
Positive predictive value (PPV), Precision = 𝑇 𝑃/𝑇𝑃+ 𝐹𝑃 !

13!
q  Common measures in IR: Precision and Recall
q  Are deﬁned in terms of a set of retrieved documents
and a set of relevant documents.
o  Precision: Fraction of retrieved documents that are
relevant to the query
o  Recall: Percent of all relevant documents that is
returned by the search
q  Both measures are usually combined in one
(harmonic mean):
Information Retrieval
asure=2 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛× 𝑅𝑒𝑐𝑎𝑙𝑙/𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛× 𝑅𝑒𝑐𝑎𝑙𝑙 =2 𝑇𝑃/2 𝑇𝑃+ 𝐹𝑃+ 𝐹𝑁

14!
q  Confusion (contingency) tables can be multiclass
q  Measures based on 2-class matrices are computed
o  1 vs all (average N partial measures)
o  1 vs 1 (average N*(N-1)/2 partial measures)
ü  Weighted average?
Multiclass problems
ERROR actual
low medium high
low 20 0 13
medium 5 15 4predicted
high 4 7 60

15!
q  In some cases we can find important differences
among proportion of classes
o  A naïve classifier that always predictive majority class
(ignoring minority classes) obtains good performance
ü  In a binary problem (+/-) with 1% of negative instances,
model “always positive” gets an accuracy of 99%.
q  Macro-accuracy: Average of accuracy per class
o  The naïve classifier gets a macro-accuracy=0.5
Imbalanced Datasets
m
total
hits
total
hits
total
Hits
hmacroacc mclass
mclass
2class
2class
1class
1class
...
)(
+++
=

q  Crisp and Soft Classifiers:
o  A “hard” or “crisp” classifier predicts a class between a
set of possible classes.
o  A “soft” or “scoring” classifier (probabilistic) predicts a
class, but accompanies each prediction with an
estimation of the reliability (confidence) of each
prediction.
ü  Most learning methods can be adapted to generate this
confidence value.
Soft classifiers
16!

q  A special kind of soft classifier is a class probability
estimator.
o  Instead of predicting “a”, “b” or “c”, it gives a
probability estimation for “a”, “b” or “c”, i.e., “pa”, “pb”
and “pc”.
ü  Example:
v  Classifier 1: pa = 0.2, pb = 0.5 and pc = 0.3.
v  Classifier 2: pa = 0.3, pb = 0.4 and pc = 0.3.
o  Both predict “b”, but classifier 1 is more confident.
Soft classifiers
17!

q  Probabilistic classifiers: Classifiers that are able to
predict a probability distribution over a set of
classes
o  Provide classification with a degree of certainty:
ü  Combining classifiers
ü  Cost sensitive contexts
q  Mean Squared Error (Brier Score)
q  Log Loss
Evaluating probabilistic classifiers
18!
∑∑∈ ∈
−=
Si Cj
jipjif
n
MSE ),(),(
1
f(i,j)=1 if instance i is of class j, 0 otherwise.
p(i,j) returns de prob. instance i in class j
∑∑∈ ∈
∗−=
Si Cj
jipjif
n
Logloss )),(log),((
1
2

q  MSE or Brier Score can be decomposed into two
factors:
o  BS=CAL+REF
ü  Calibration: Measures the quality of classifier scores wrt
class membership probabilities
ü  Refinement: it is an aggregation of resolution and uncertainty,
and is related to the area under the ROC Curve.
q  Calibration Methods:
o  Try to transform classifier scores into class membership
probabilities
ü  Platt scaling, Isotonic Regression, PAVcal..
19!

q  Brier Curves for analysing classiﬁer performance
20!
Non calibrated Brier curve ! PAV-calibrated Brier curve !

q  “Rankers”:
o  Whenever we have a probability estimator for a two-
class problem:
ü  pa = x, then pb = 1 ‒ x.
o  Let’s call one class 0 (neg) and the other class 1 (pos).
o  A ranker is a soft classiﬁer that gives a value (score)
monotonically related to the probability of class 1.
ü  Examples:
v  Probability of a customer buying a product.
v  Probability of a message being spam....
Soft classiﬁers
21!

22!
q  We can rank instances according to estimated
probability
o  CRM: You are interested in the top % of potential
costumers
q  Measures for ranking
o  AUC: Area Ander the ROC Curve
o  Distances between perfect ranking and estimated
ranking
Evaluation of Rankers

23!
q  Regression: In this case the variable to be predicted
is a continuous value.
o  Predicting daily value of stocks in NASDAQ
o  Forecasting number of docks available in a Valenbisi
station in the next hour
o  Predicting the amount of beers sold the next month by
a retail company
Regression

24!
q  Given a set S of n instances,
o  Mean Absolute Error:

o  Mean Squared Error:
o  Root Mean Suared Error:

ü  MSE is more sensitive to extreme values
ü  RMSE and MAE are in the same magnitude of the actual values
Evaluation of Regressors
∑∈
−=
Sx
S xhxf
n
hMAE )()(
1
)(
∑∈
−=
Sx
S xhxf
n
hMSE 2
))()((
1
)(
∑∈
−=
Sx
S xhxf
n
hRMSE 2
))()((
1
)(

25!
q  Example:
Predicted Value(h(x)) Acual Value (f(x)) Error Error2
100 mill. € 102 mill. € 2 4
102 mill. € 110 mill. € 8 64
105 mill. € 95 mill. € 10 100
95 mill. € 75 mill. € 20 400
101 mill. € 103 mill. € 2 4
105 mill. € 110 mill. € 5 25
105 mill. € 98 mill. € 7 49
40 mill. € 32 mill. € 8 64
220 mill. € 215 mill. € 5 25
100 mill. € 103 mill. € 3 9
MSE= 744/10 = 74,4!
MAE= 60/10 =6!
RMSE= sqrt(744/10) = 8.63!

26!
q  Sometimes relative error values are more
appropiate:
o  10% for an error of 50 when predicting 500
q  How much does the scheme improve on simply
predicting the average:
o  Relative Mean Squared Error
o  Relative Mean Absolute Error

( )
( )∑
∑
∈
∈
−
−
=
Sx
Sx
S
xff
xhxf
hRSE 2
2
)(
)()(
)(
∑
∑
∈
∈
−
−
=
Sx
Sx
S
xhf
xhxf
hRAE
)(
)()(
)(

27!
q  A related measure is R2 (coefficient of
determination)
o  Number that indicates how well data fit a statistical
model
ü  sum of squares of residuals

ü  total sum of squares
ü  coefficient of determination

( )∑∈
−=
Sx
s xffhSStot
2
)()(
)(
)(
1)(2
hSStot
hSSres
hR
s
s
S −=
( )∑∈
−=
Sx
s xhfhSSres
2
)()(

28!
q  Association Rules: task of discovering interesting
relations between variables in databases.
q  Common metrics:
o  Support: Estimates the popularity of a rule
o  Conﬁdence: Estimates the reliability of a rule
q  Rules are ordered according to a measures that
combine both values
q  No partition train/test.
Unsupervised Learning

29!
q  Clustering: task of grouping a set of objects in such
a way that objects in the same group (cluster) are
more similar to each other than to those in other
clusters.
o  Task diﬃcult to evaluate
q  Some evaluation measures based on distance:
o  Distance among borders of clusters
o  Distance among centers (centroids) of clusters
o  Radius and density of clusters
Unsupervised Learning

q  Model evaluation is a fundamental phase in the
knowledge discovery process.
q  In classiﬁcation, depending on the feature and
context we want to analyse, we need to use the
proper metric.
q  Several (and sometimes equivalent) measures for
regression models.
q  Supervised models are easier to evaluate since we
have an estimate of the ground truth.
Lessons learned
30!

q  Witten, Ian H., and Eibe Frank. Data Mining: Practical machine learning tools
and techniques. Morgan Kaufmann, 2005.
q  Flach, P.A. (2012)“Machine Learning: The Art and Science of Algorithms that
Make Sense of Cambridge University Press.
q  Hand, D.J. (1997) “Construction and Assessment of Classification Rules”,
Wiley.
q  César Ferri, José Hernández-Orallo, R. Modroiu (2009): An experimental
comparison of performance measures for classification. Pattern Recognition
Letters 30(1): 27-38
q  Nathalie Japkowicz, ‎Mohak Shah (2011) “Evaluating Learning Algorithms: A
Classification Perspective”, Cambridge University Press 2011.
q  Antonio Bella, Cèsar Ferri Ramirez, José Hernández-Orallo, M. José Ramírez-
Quintana: On the effect of calibration in classifier combination. Appl. Intell.
38(4): 566-585 (2013)
q  José Hernández-Orallo, Peter A. Flach, Cèsar Ferri Ramirez:Brier Curves: a
New Cost-Based Visualisation of Classifier Performance. ICML 2011: 585-592
To know more
31!

L2. Evaluating Machine Learning Algorithms I

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to L2. Evaluating Machine Learning Algorithms I

Similar to L2. Evaluating Machine Learning Algorithms I (20)

More from Machine Learning Valencia

More from Machine Learning Valencia (10)

Recently uploaded

Recently uploaded (20)

L2. Evaluating Machine Learning Algorithms I