ML Model Performance Evaluation

1
ML-Chapter Four:
Model Performance Evaluation
Belay E., Asst. Prof.
e-mail: belayenyew@gmail.com
Mobile: 0946235206
University of Gondar
College of Informatics
Department of Information Technology

2
Evaluating Model Performance
• In experimental Machine Learning, we evaluate the accuracy
of a model empirically.
• Use to comparing Classifiers/Learning Algorithms
• What is an evaluation’s Metric?
o A way to quantify a performance of a machine learning model.
o It uses for the evaluation of the performance of the machine
learning model and why to use one in place of the other.
• For classification:
o Confusion matrix, Accuracy, Precision, Recall, Specificity, F1
score Precision-Recall or PR curve and ROC (Receiver Operating
Characteristics) curve
• Prediction(Regression): MAE, MSE, RMSE and R2
2

3
Evaluating classification Model Performance
• Suppose a binary classification problem like a patient is
having cancer (positive) or is found healthy (negative)
• Some common terminology:
o True positives (TP): Predicted positive and are
actually positive.
o False positives (FP): Predicted positive and are
actually negative
o True negatives (TN): Predicted negative and are
actually negative.
o False negatives (FN): Predicted negative and are
actually positive.
3

4
• Confusion matrix: The above terms are summarized
using confusion matrix.
• TP and TN tell us when the classiﬁer is getting
things right, while FP and FN tell us when the
classiﬁer is getting things wrong
4

5
• Accuracy (Recognition Rate): calculated as the number
of all correct predictions divided by the total number of
the testing dataset.
o The most commonly used metric to judge a model
o It is not a clear indicator of the performance when classes are
imbalanced.
𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 =
𝑻𝑷 + 𝑻𝑵
𝑻𝑷 + 𝑭𝑷 + 𝑻𝑵 + 𝑭𝑵
• Error rate or misclassiﬁcation rate:
error rate=1 – accuracy, or
𝒆𝒓𝒓𝒐𝒓 𝒓𝒂𝒕𝒆 =
𝑭𝑷 + 𝑭𝑵
𝑻𝑷 + 𝑭𝑷 + 𝑻𝑵 + 𝑭𝑵
5

6
Example of Confusion Matrix:
• An example of a confusion matrix for the two classes
buys-computer=yes (positive) and buys-computer=no
(negative)
• Accuracy = (TP + TN)/total=(6954+2588)/10000=0.95
• error rate or misclassiﬁcation rate=1 – accuracy, or
Error rate = (FP + FN)/total=(412+46)/10000=0.05

7
Precision and Recall
• Precision: Percentage of positive instances out of the total
predicted instances i.e., what percentage of tuples labeled as
positive are actually positive.
𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =
𝑻𝑷
𝑻𝑷 + 𝑭𝑷
• Recall/Sensitivity/True Positive Rate : Percentage of positive
instances out of the total actual positive instances i.e. what
percentage of positive tuples are labeled as positive.
𝒓𝒆𝒄𝒂𝒍𝒍 𝒐𝒓 𝑻𝑷𝑹 𝒐𝒓 𝒔𝒆𝒏𝒔𝒊𝒕𝒊𝒗𝒊𝒕𝒚 =
𝑻𝑷
𝑻𝑷 + 𝑭𝑵
o Perfect score is 1.0
• Specificity: Percentage of negative instances out of the total actual
negative instances 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁
𝑇𝑁+𝐹𝑃
7

8
Precision and Recall-Example
 Based on the previous confusion matrix:
 Precision = 6954/(6954+412) = 0.944 =94.4%
 Recall = 6954/(6954+46)= 0.9934=99.34%
 A perfect precision score of 1.0 for a class C means that
every tuple that the classiﬁer labeled as belonging to class C
does indeed belong to class C. However, it does not tell us
anything about the number of class C tuples that the
classiﬁer mislabeled.
 A perfect recall score of 1.0 for C means that every item
from class C was labeled as such, but it does not tell us how
many other tuples were incorrectly labeled as belonging to
class C
 Specificity=2588/(2588+412)=0.8626=86.26% 8

9
Precision and Recall-Example
F measure (F1 or F-score): harmonic mean of precision and recall.
 It an alternative way to use precision and recall by combining
them into a single measure .
𝑭𝟏 =
𝟐 ∗ 𝒓𝒆𝒄𝒂𝒍𝒍 ∗ 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏
𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 + 𝒓𝒆𝒄𝒂𝒍𝒍
F1=(2*0.944*0.9934)/(0.9934+0.944)=1.8755392/1.9374=0.9680
• The higher the F1 score, the better.
Suppose
Model 1: Recall =70% and precision= 60%
Model 2: Recall =80% and precision=50%
Which model is better? Use F measure

10
class imbalance problem
 where the main class of interest is rare.
 The data set distribution reflects a significant majority of the
negative class and a minority of positive class.
• For example, in fraud detection applications, the class of
interest (or positive class) is “fraud,” which occurs much
less frequently than the negative “nonfraudulant” class.
• class imbalance problem identified using:
o Sensitivity is also referred to as the true positive (recognition)
rate (i.e., the proportion of positive tuples that are correctly
identified), while specificity is the true negative rate (i.e., the
proportion of negative tuples that are correctly identified)
10

11
class imbalance problem
Example, a confusion matrix for medical data where the class values are yes and no
for a class label attribute, cancer.
• Confusion matrix for the classes cancer=yes and cancer =no.
 The sensitivity of the classifier is 90/300=30%
 The Specificity of classifier is 9560/9700=98.56%
 Overall accuracy is 9650/10000=96.50%
• Thus, we note that although the classiﬁer has a high accuracy, it’s
ability to correctly label the positive (rare) class is poor given its
low sensitivity.
• It has high speciﬁcity, meaning that it can accurately recognize negative tuples.
11

12
ROC curves
• A Receiver Operating Characteristic (ROC) curve plots
the TP-rate vs. the FP-rate as a threshold on the
confidence of an instance being positive is varied
12

13
ROC curves
• ROC for visual comparison of models
• Originated from signal detection theory
• Shows the trade-off between the true positive rate and the false
positive rate
• The area under the ROC curve is a measure of the accuracy of the
model
• Rank the test tuples in decreasing order: the one that is most
likely to belong to the positive class appears at the top of the list
• The closer to the diagonal line (i.e., the closer the area is to 0.5),
the less accurate is the model
o Vertical axis represents the true positive rate
o Horizontal axis rep. the false positive rate
o The plot also shows a diagonal line
o A model with perfect accuracy will have an area of 1.0
13

14
Predictor Error Measures
• Measure predictor accuracy: measure how far off the
predicted value is from the actual known value.
• Loss function: measures the error between actual
values(𝑦𝑖) and the predicted values(𝑦𝑖)
o Absolute error: 𝒚𝒊 − 𝒚𝒊
o Squared error: 𝒚𝒊 − 𝒚𝒊
𝟐
• Test error: the average loss over the test set
o Mean absolute error(MAE): 𝒊=𝟏
𝑵
𝒚𝒊−𝒚𝒊
𝑵
o Mean Squared error(MSE): 𝒊=𝟏
𝑵
𝒚𝒊−𝒚𝒊
𝟐
𝑵
o Root Mean Squared Error(RMSE): 𝒊=𝟏
𝑵 𝒚𝒊−𝒚𝒊
𝟐
𝑵
where 𝑦 is average values and N is the number of observations.
14

15
Predictor Error Measures
o Relative absolute error: 𝒊=𝟏
𝑵
𝒚𝒊−𝒚𝒊
𝒊=𝟏
𝑵 𝒚𝒊−𝒚
o Relative square error: 𝑖=1
𝑁
𝑦𝑖−𝑦𝑖
2
𝑖=1
𝑁 𝑦𝑖−𝑦 2
• The mean squared error exaggerates the presence of outliers
• Popularly use( square) root mean-square error, similarly , root
relative squared error.
• R Squared(R2) :
o Indicates how well the model prediction approximates the true
vales.
o 1 indicates prefect fit and 0 show low performance
𝑅2 = 1 −
𝑖=1
𝑁
𝑦𝑖 − 𝑦𝑖
2
𝑖=1
𝑁
𝑦𝑖 − 𝑦 2
15

16
Issues Affecting Model Selection
• Accuracy
o classifier accuracy: predicting class label
• Speed
o time to construct the model (training time)
o time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
o understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size
or compactness of classification rules.
16

17
Improving Classiﬁcation Accuracy of Class-
Imbalanced Data
• General approaches for improving the classiﬁcation
accuracy of class-imbalanced data: oversampling and
Undersampling
• Oversampling and undersampling change the
distribution of tuples in the training set.
• Both oversampling and undersampling change the
training data distribution so that the rare (positive)
class is well represented
17

18
Cont..
• Oversampling works by resampling the positive tuples
so that the resulting training set contains an equal
number of positive and negative tuples.
• Undersampling works by decreasing the number of
negative tuples. It randomly eliminates tuples from the
majority (negative) class until there are an equal number
of positive and negative tuples
18

ML Model Performance Evaluation

Recommended

Recommended

More Related Content

Similar to ML Model Performance Evaluation

Similar to ML Model Performance Evaluation (20)

Recently uploaded

Recently uploaded (20)

ML Model Performance Evaluation