Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
ย
ML Model Performance Evaluation
1. 1
ML-Chapter Four:
Model Performance Evaluation
Belay E., Asst. Prof.
e-mail: belayenyew@gmail.com
Mobile: 0946235206
University of Gondar
College of Informatics
Department of Information Technology
2. 2
Evaluating Model Performance
โข In experimental Machine Learning, we evaluate the accuracy
of a model empirically.
โข Use to comparing Classifiers/Learning Algorithms
โข What is an evaluationโs Metric?
o A way to quantify a performance of a machine learning model.
o It uses for the evaluation of the performance of the machine
learning model and why to use one in place of the other.
โข For classification:
o Confusion matrix, Accuracy, Precision, Recall, Specificity, F1
score Precision-Recall or PR curve and ROC (Receiver Operating
Characteristics) curve
โข Prediction(Regression): MAE, MSE, RMSE and R2
2
3. 3
Evaluating classification Model Performance
โข Suppose a binary classification problem like a patient is
having cancer (positive) or is found healthy (negative)
โข Some common terminology:
o True positives (TP): Predicted positive and are
actually positive.
o False positives (FP): Predicted positive and are
actually negative
o True negatives (TN): Predicted negative and are
actually negative.
o False negatives (FN): Predicted negative and are
actually positive.
3
4. 4
Evaluating Model Performance
โข Confusion matrix: The above terms are summarized
using confusion matrix.
โข TP and TN tell us when the classi๏ฌer is getting
things right, while FP and FN tell us when the
classi๏ฌer is getting things wrong
4
5. 5
Evaluating Model Performance
โข Accuracy (Recognition Rate): calculated as the number
of all correct predictions divided by the total number of
the testing dataset.
o The most commonly used metric to judge a model
o It is not a clear indicator of the performance when classes are
imbalanced.
๐จ๐๐๐๐๐๐๐ =
๐ป๐ท + ๐ป๐ต
๐ป๐ท + ๐ญ๐ท + ๐ป๐ต + ๐ญ๐ต
โข Error rate or misclassi๏ฌcation rate:
error rate=1 โ accuracy, or
๐๐๐๐๐ ๐๐๐๐ =
๐ญ๐ท + ๐ญ๐ต
๐ป๐ท + ๐ญ๐ท + ๐ป๐ต + ๐ญ๐ต
5
6. 6
Example of Confusion Matrix:
โข An example of a confusion matrix for the two classes
buys-computer=yes (positive) and buys-computer=no
(negative)
โข Accuracy = (TP + TN)/total=(6954+2588)/10000=0.95
โข error rate or misclassi๏ฌcation rate=1 โ accuracy, or
Error rate = (FP + FN)/total=(412+46)/10000=0.05
7. 7
Precision and Recall
โข Precision: Percentage of positive instances out of the total
predicted instances i.e., what percentage of tuples labeled as
positive are actually positive.
๐๐๐๐๐๐๐๐๐ =
๐ป๐ท
๐ป๐ท + ๐ญ๐ท
โข Recall/Sensitivity/True Positive Rate : Percentage of positive
instances out of the total actual positive instances i.e. what
percentage of positive tuples are labeled as positive.
๐๐๐๐๐๐ ๐๐ ๐ป๐ท๐น ๐๐ ๐๐๐๐๐๐๐๐๐๐๐ =
๐ป๐ท
๐ป๐ท + ๐ญ๐ต
o Perfect score is 1.0
โข Specificity: Percentage of negative instances out of the total actual
negative instances ๐ ๐๐๐๐๐๐๐๐๐ก๐ฆ =
๐๐
๐๐+๐น๐
7
8. 8
Precision and Recall-Example
๏ Based on the previous confusion matrix:
๏ผ Precision = 6954/(6954+412) = 0.944 =94.4%
๏ผ Recall = 6954/(6954+46)= 0.9934=99.34%
๏ A perfect precision score of 1.0 for a class C means that
every tuple that the classi๏ฌer labeled as belonging to class C
does indeed belong to class C. However, it does not tell us
anything about the number of class C tuples that the
classi๏ฌer mislabeled.
๏ A perfect recall score of 1.0 for C means that every item
from class C was labeled as such, but it does not tell us how
many other tuples were incorrectly labeled as belonging to
class C
๏ Specificity=2588/(2588+412)=0.8626=86.26% 8
9. 9
Precision and Recall-Example
F measure (F1 or F-score): harmonic mean of precision and recall.
๏ It an alternative way to use precision and recall by combining
them into a single measure .
๐ญ๐ =
๐ โ ๐๐๐๐๐๐ โ ๐๐๐๐๐๐๐๐๐
๐๐๐๐๐๐๐๐๐ + ๐๐๐๐๐๐
F1=(2*0.944*0.9934)/(0.9934+0.944)=1.8755392/1.9374=0.9680
โข The higher the F1 score, the better.
Suppose
Model 1: Recall =70% and precision= 60%
Model 2: Recall =80% and precision=50%
Which model is better? Use F measure
10. 10
class imbalance problem
๏ where the main class of interest is rare.
๏ The data set distribution re๏ฌects a signi๏ฌcant majority of the
negative class and a minority of positive class.
โข For example, in fraud detection applications, the class of
interest (or positive class) is โfraud,โ which occurs much
less frequently than the negative โnonfraudulantโ class.
โข class imbalance problem identified using:
o Sensitivity is also referred to as the true positive (recognition)
rate (i.e., the proportion of positive tuples that are correctly
identi๏ฌed), while speci๏ฌcity is the true negative rate (i.e., the
proportion of negative tuples that are correctly identi๏ฌed)
10
11. 11
class imbalance problem
Example, a confusion matrix for medical data where the class values are yes and no
for a class label attribute, cancer.
โข Confusion matrix for the classes cancer=yes and cancer =no.
๏ผ The sensitivity of the classifier is 90/300=30%
๏ผ The Specificity of classifier is 9560/9700=98.56%
๏ผ Overall accuracy is 9650/10000=96.50%
โข Thus, we note that although the classi๏ฌer has a high accuracy, itโs
ability to correctly label the positive (rare) class is poor given its
low sensitivity.
โข It has high speci๏ฌcity, meaning that it can accurately recognize negative tuples.
11
12. 12
ROC curves
โข A Receiver Operating Characteristic (ROC) curve plots
the TP-rate vs. the FP-rate as a threshold on the
confidence of an instance being positive is varied
12
13. 13
ROC curves
โข ROC for visual comparison of models
โข Originated from signal detection theory
โข Shows the trade-off between the true positive rate and the false
positive rate
โข The area under the ROC curve is a measure of the accuracy of the
model
โข Rank the test tuples in decreasing order: the one that is most
likely to belong to the positive class appears at the top of the list
โข The closer to the diagonal line (i.e., the closer the area is to 0.5),
the less accurate is the model
o Vertical axis represents the true positive rate
o Horizontal axis rep. the false positive rate
o The plot also shows a diagonal line
o A model with perfect accuracy will have an area of 1.0
13
14. 14
Predictor Error Measures
โข Measure predictor accuracy: measure how far off the
predicted value is from the actual known value.
โข Loss function: measures the error between actual
values(๐ฆ๐) and the predicted values(๐ฆ๐)
o Absolute error: ๐๐ โ ๐๐
o Squared error: ๐๐ โ ๐๐
๐
โข Test error: the average loss over the test set
o Mean absolute error(MAE): ๐=๐
๐ต
๐๐โ๐๐
๐ต
o Mean Squared error(MSE): ๐=๐
๐ต
๐๐โ๐๐
๐
๐ต
o Root Mean Squared Error(RMSE): ๐=๐
๐ต ๐๐โ๐๐
๐
๐ต
where ๐ฆ is average values and N is the number of observations.
14
15. 15
Predictor Error Measures
o Relative absolute error: ๐=๐
๐ต
๐๐โ๐๐
๐=๐
๐ต ๐๐โ๐
o Relative square error: ๐=1
๐
๐ฆ๐โ๐ฆ๐
2
๐=1
๐ ๐ฆ๐โ๐ฆ 2
โข The mean squared error exaggerates the presence of outliers
โข Popularly use( square) root mean-square error, similarly , root
relative squared error.
โข R Squared(R2) :
o Indicates how well the model prediction approximates the true
vales.
o 1 indicates prefect fit and 0 show low performance
๐ 2 = 1 โ
๐=1
๐
๐ฆ๐ โ ๐ฆ๐
2
๐=1
๐
๐ฆ๐ โ ๐ฆ 2
15
16. 16
Issues Affecting Model Selection
โข Accuracy
o classifier accuracy: predicting class label
โข Speed
o time to construct the model (training time)
o time to use the model (classification/prediction time)
โข Robustness: handling noise and missing values
โข Scalability: efficiency in disk-resident databases
โข Interpretability
o understanding and insight provided by the model
โข Other measures, e.g., goodness of rules, such as decision tree size
or compactness of classification rules.
16
17. 17
Improving Classi๏ฌcation Accuracy of Class-
Imbalanced Data
โข General approaches for improving the classi๏ฌcation
accuracy of class-imbalanced data: oversampling and
Undersampling
โข Oversampling and undersampling change the
distribution of tuples in the training set.
โข Both oversampling and undersampling change the
training data distribution so that the rare (positive)
class is well represented
17
18. 18
Cont..
โข Oversampling works by resampling the positive tuples
so that the resulting training set contains an equal
number of positive and negative tuples.
โข Undersampling works by decreasing the number of
negative tuples. It randomly eliminates tuples from the
majority (negative) class until there are an equal number
of positive and negative tuples
18