SlideShare a Scribd company logo
1 of 75
Download to read offline
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/327403649
Classification assessment methods: a detailed tutorial
Presentation · September 2018
CITATIONS
0
READS
5,817
1 author:
Some of the authors of this publication are also working on these related projects:
Enzyme Classification and Prediction View project
Statistical tests View project
Alaa Tharwat
Frankfurt University of Applied Sciences
107 PUBLICATIONS   1,031 CITATIONS   
SEE PROFILE
All content following this page was uploaded by Alaa Tharwat on 03 September 2018.
The user has requested enhancement of the downloaded file.
Classification assessment method: a detailed
tutorial
Alaa Tharwat
Email: engalaatharwat@hotmail.com
Alaa Tharwat 1 / 74
Introduction
Classification Performance
Classification metrics with imbalanced data
Different Assessment methods
Illustrative example
Receiver Operating Characteristics (ROC)
Area under the ROC curve (AUC)
Precision-Recall (PR) curve
Biometrics measures
Conclusions
Alaa Tharwat 2 / 74
Introduction
Classification Performance
Classification metrics with imbalanced data
Different Assessment methods
Illustrative example
Receiver Operating Characteristics (ROC)
Area under the ROC curve (AUC)
Precision-Recall (PR) curve
Biometrics measures
Conclusions
Alaa Tharwat 3 / 74
Introduction
Classification techniques have been applied to many applications in
various fields of sciences
In classification models, the training data are used for building a
classification model to predict the class label for a new sample
The outputs of classification models can be discrete as in the
decision tree classifier or continuous as the Naive Bayes classifier
The outputs of learning algorithms need to be assessed and analyzed
carefully and this analysis must be interpreted correctly, so as to
evaluate different learning algorithms.
The classification performance is represented by scalar values as in
different metrics such as accuracy, sensitivity, and specificity
Graphical assessment methods such as Receiver operating
characteristics (ROC) and Precision-Recall curves give different
interpretations of the classification performance
Some of the measures which are derived from the confusion matrix
for evaluating a diagnostic test
Alaa Tharwat 4 / 74
Introduction
Introduction
Classification Performance
Classification metrics with imbalanced data
Different Assessment methods
Illustrative example
Receiver Operating Characteristics (ROC)
Area under the ROC curve (AUC)
Precision-Recall (PR) curve
Biometrics measures
Conclusions
Alaa Tharwat 5 / 74
Classification Performance
According to the number of classes, there are two types of
classification problems, namely, binary classification where there are
only two classes, and multi-class classification where the number of
classes is higher than two
Assume we have two classes, i.e., binary classification, P for positive
class and N for negative class
An unknown sample is classified to P or N
The classification model that was trained in the training phase is
used to predict the true classes of unknown samples
This classification model produces continuous or discrete outputs.
The discrete output that is generated from a classification model
represents the predicted discrete class label of the unknown/test
sample, while continuous output represents the estimation of the
sample’s class membership probability
Alaa Tharwat 6 / 74
Classification Performance
There are four possible outputs which represent the elements of a
2 × 2 confusion matrix or a contingency table
If the sample is positive and it is classified as positive, i.e., correctly
classified positive sample, it is counted as a true positive (TP); if it is
classified as negative, it is considered as a false negative (FN)
If the sample is negative and it is classified as negative it is considered
as true negative (TN); if it is classified as positive, it is counted as
false positive (FP), false alarm
The confusion matrix is used to calculate many common classification
metrics
The green diagonal represents correct predictions and the pink
diagonal indicates the incorrect predictions
True/Actual Class
Positive (P) Negative (N)
Predicted
Class
True (T)
False (F)
P=TP+FN N=FP+TN
True Positive
(TP)
True Negative
(TN)
False Negative
(FN)
False Positive
(FP)
Figure: An illustrative example of the 2 × 2 confusion matrix. There are two
true classes P and N. The output of the predicted class is true or false.
Alaa Tharwat 7 / 74
Classification Performance confusion matrix for multi-class classification problems
The figure below shows the confusion matrix for a multi-class
classification problem with three classes (A, B, and C)
TPA is the number of true positive samples in class A, i.e., the
number of samples that are correctly classified from class A
EAB is the samples from class A that were incorrectly classified as
class B, i.e., misclassified samples. Thus, the false negative in the A
class (FNA) is the sum of EAB and EAC (FNA = EAB + EAC)
which indicates the sum of all class A samples that were incorrectly
classified as class B or C
Simply, FN of any class which is located in a column can be
calculated by adding the errors in that class/column
The false positive for any predicted class which is located in a row
represents the sum of all errors in that row
True Class
A
Predicted
Class
B C
A
B
C
TPA
TPB
TPC
EBA ECA
ECB
EBCEAC
EAB
Alaa Tharwat 8 / 74
Classification Performance confusion matrix for multi-class classification problems
Introduction
Classification Performance
Classification metrics with imbalanced data
Different Assessment methods
Illustrative example
Receiver Operating Characteristics (ROC)
Area under the ROC curve (AUC)
Precision-Recall (PR) curve
Biometrics measures
Conclusions
Alaa Tharwat 9 / 74
Classification metrics with imbalanced data
Different assessment methods are sensitive to the imbalanced data
when the samples of one class in a dataset outnumber the samples of
the other class(es)
The class distribution is the ratio between the positive and negative
samples ( P
N ) represents the relationship between the left column to
the right column
Any assessment metric that uses values from both columns will be
sensitive to the imbalanced data
For example, some metrics such as accuracy and precision use values
from both columns of the confusion matrix; thus, as data distributions
change, these metrics will change as well, even if the classifier
performance does not
This fact is partially true because there are some metrics such as
Geometric Mean (GM) and Youden’s index (Y I) use values from
both columns and these metrics can be used with balanced and
imbalanced data
This can be interpreted as that the metrics which use values from
one column cancel the changes in the class distribution
Alaa Tharwat 10 / 74
Classification metrics with imbalanced data
However, some metrics which use values from both columns are not
sensitive to the imbalanced data because the changes in the class
distribution cancel each other
For example, the accuracy is defined as follows,
Acc = T P +T N
T P +T N+F P +F N and the GM is defined as follows,
GM =
√
TPR × TNR = T P
T P +F N × T N
T N+F P ; thus, both metrics
use values from both columns of the confusion matrix
Changing the class distribution can be obtained by
increasing/decreasing the number of samples of negative/positive class
With the same classification performance, assume that the negative
class samples are increased by α times; thus, the TN and FP values
will be αTN and αFP, respectively; thus, the accuracy will be,
Acc = T P +αT N
T P +αT N+αF P +F N = T P +T N
T P +T N+F P +F N . This means that the
accuracy is affected by the changes in the class distribution
On the other hand, the GM metric will be,
GM = T P
T P +F N × αT N
αT N+αF P
= T P
T P +F N × T N
T N+F P and hence
the changes in the negative class cancel each other. This is the reason
why the GM metric is suitable for the imbalanced data
Alaa Tharwat 11 / 74
Classification metrics with imbalanced data
Introduction
Classification Performance
Classification metrics with imbalanced data
Different Assessment methods
Illustrative example
Receiver Operating Characteristics (ROC)
Area under the ROC curve (AUC)
Precision-Recall (PR) curve
Biometrics measures
Conclusions
Alaa Tharwat 12 / 74
Different Assessment methods Accuracy and error rate
Accuracy (Acc) is one of the most commonly used measures for the
classification performance
Acc =
TP + TN
TP + TN + FP + FN
where P and N indicate the number of positive and negative
samples, respectively.
The complement of the accuracy metric is the Error rate (ERR) or
misclassification rate. This metric represents the number of
misclassified samples from both positive and negative classes, and it
is calculated as follows,
EER = 1 − Acc = (FP + FN)/(TP + TN + FP + FN)
Both accuracy and error rate metrics are sensitive to the imbalanced
data
Another problem with the accuracy is that two classifiers can yield
the same accuracy but perform differently with respect to the types
of correct and incorrect decisions they provide
Alaa Tharwat 13 / 74
Different Assessment methods Sensitivity and specificity
Sensitivity, True positive rate (TPR), hit rate, or recall, of a
classifier represents the positive correctly classified samples to the
total number of positive samples, and it is estimated
specificity, True negative rate (TNR), or inverse recall is expressed
as the ratio of the correctly classified negative samples to the total
number of negative samples
TPR =
TP
TP + FN
=
TP
P
, TNR =
TN
FP + TN
=
TN
N
Generally, we can consider sensitivity and specificity as two kinds of
accuracy, where the first for actual positive samples and the second
for actual negative samples
Sensitivity depends on TP and FN which are in the same column of
the confusion matrix, and similarly, the specificity metric depends on
TN and FP which are in the same column; hence, both sensitivity
and specificity can be used for evaluating the classification
performance with imbalanced data
Alaa Tharwat 14 / 74
Different Assessment methods Sensitivity and specificity
The accuracy can also be defined in terms of sensitivity and
specificity
Acc =
TP + TN
TP + TN + FP + FN
= TPR ×
P
P + N
+ TNR ×
N
P + N
=
TP
TP + FN
P
P + N
+
TN
TN + FP
N
P + N
=
TP
P + N
+
TN
P + N
=
TP + TN
TP + TN + FP + FN
Alaa Tharwat 15 / 74
Different Assessment methods False positive and False negative rates
False positive rate (FPR) is also called false alarm rate (FAR), or
Fallout, and it represents the ratio between the incorrectly classified
negative samples to the total number of negative samples. It is the
proportion of the negative samples that were incorrectly classified.
Hence, it complements the specificity
FPR = 1 − TNR =
FP
FP + TN
=
FP
N
The False negative rate (FNR) or miss rate is the proportion of
positive samples that were incorrectly classified. Thus, it
complements the sensitivity measure
FNR = 1 − TPR =
FN
FN + TP
=
FN
P
Both FPR and FNR are not sensitive to changes in data distributions
and hence both metrics can be used with imbalanced data
Alaa Tharwat 16 / 74
Different Assessment methods Predictive values
Predictive values (positive and negative) reflect the performance of
the prediction
Positive prediction value (PPV ) or precision represents the
proportion of positive samples that were correctly classified to the
total number of positive predicted samples
PPV = Precision =
TP
FP + TP
= 1 − FDR
Negative predictive value (NPV ), inverse precision, or true
negative accuracy (TNA) measures the proportion of negative
samples that were correctly classified to the total number of negative
predicted samples
NPV =
TN
FN + TN
= 1 − FOR
These two measures are sensitive to the imbalanced data
False discovery rate (FDR) and False omission rate (FOR) measures
complements the PPV and NPV, respectively
Alaa Tharwat 17 / 74
Different Assessment methods Predictive values
The accuracy can be defined in terms of precision and inverse
precision
Acc =
TP + FP
P + N
× PPV +
TN + FN
P + N
× NPV
=
TP + FP
P + N
×
TP
TP + FP
+
TN + FN
P + N
×
TN
TN + FN
=
TP + TN
TP + TN + FP + FN
Alaa Tharwat 18 / 74
Different Assessment methods Likelihood ratio
The likelihood ratio combines both sensitivity and specificity, and it
is used in diagnostic tests
In that tests, not all positive results are true positives and also the
same for negative results; hence, the positive and negative results
change the probability/likelihood of diseases. Likelihood ratio
measures the influence of a result on the probability
Positive likelihood (LR+) measures how much the odds of the
disease increases when a diagnostic test is positive. Similarly,
Negative likelihood (LR−) measures how much the odds of the
disease decreases when a diagnostic test is negative
LR+ =
TPR
1 − TNR
=
TPR
FPR
, LR− =
1 − TPR
TNR
Both measures depend on the sensitivity and specificity measures;
thus, they are suitable for balanced and imbalanced data
Alaa Tharwat 19 / 74
Different Assessment methods Likelihood ratio
Both LR+ and LR− are combined into one measure which
summarizes the performance of the test, this measure is called
Diagnostic odds ratio (DOR)
The DOR metric represents the ratio between the positive likelihood
ratio to the negative likelihood ratio
This measure is utilized for estimating the discriminative ability of
the test and also for comparing between two diagnostic tests
DOR =
LR+
LR−
=
TPR
1 − TNR
×
TNR
1 − TPR
=
TP × TN
FP × FN
Alaa Tharwat 20 / 74
Different Assessment methods Youden’s index
Youden’s index (Y I) or Bookmaker Informedness (BM) metric is
one of the well-known diagnostic tests
It evaluates the discriminative power of the test
The formula of Youden’s index combines the sensitivity and
specificity as in the DOR metric
Y I = TPR + TNR − 1
The Y I metric is ranged from zero when the test is poor to one
which represents a perfect diagnostic test. It is also suitable with
imbalanced data
One of the major disadvantages of this test is that it does not
change concerning the differences between the sensitivity and
specificity of the test
For example, given two tests, the sensitivity values for the first and
second tests are 0.7 and 0.9, respectively, and the specificity values for
the first and second tests are 0.8 and 0.6, respectively; the Y I value
for both tests is 0.5
Alaa Tharwat 21 / 74
Different Assessment methods Another metrics
There are many different metrics that can be calculated from the previous
metrics
1- Matthews correlation coefficient (MCC): this metric
represents the correlation between the observed and predicted
classifications, and it is calculated directly from the confusion matrix
MCC =
TP × TN − FP × FN
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
=
TP
N − TPR × PPV
PPV × TPR(1 − TPR)(1 − PPV )
A coefficient of +1 indicates a perfect prediction, −1 represents total
disagreement between prediction and true values and zero means that
no better than random prediction
This metric is sensitive to imbalanced data
Alaa Tharwat 22 / 74
Different Assessment methods Another metrics
2-Discriminant power (DP): this measure depends on the
sensitivity and specificity
This metric evaluates how well the classification model distinguishes
between positive and negative samples
DP =
√
3
π
(log(
TPR
1 − TNR
) + log(
TNR
1 − TPR
))
Since this metric depends on the sensitivity and specificity metrics; it
can be used with imbalanced data.
Alaa Tharwat 23 / 74
Different Assessment methods Another metrics
3-F-measure: this is also called F1-score, and it represents the
harmonic mean of precision and recall
F-measure =
2PPV × TPR
PPV + TPR
=
2TP
2TP + FP + FN
The value of F-measure is ranged from zero to one, and high values
of F-measure indicate high classification performance
Alaa Tharwat 24 / 74
Different Assessment methods Another metrics
F-measure has another variant which is called Fβ-measure. This
variant represents the weighted harmonic mean between precision
and recall
Fβ-measure = (1 + β2
)
PPV.TPR
β2PPV + TPR
=
(1 + β2)TP
(1 + β2)TP + β2FN + FP
This metric is sensitive to changes in data distributions
Assume that the negative class samples are increased by α times; thus,
the F-measure is calculated as follows, F-measure = 2T P
2T P +αF P +αF N
and hence this metric is affected by the changes in the class
distribution
Alaa Tharwat 25 / 74
Different Assessment methods Another metrics
The F-measures used only three of the four elements of the
confusion matrix and hence two classifiers with different TNR values
may have the same F-score
Therefore, the Adjusted F-measure (AGF) metric is introduced to
use all elements of the confusion matrix and provide more weights to
samples which are correctly classified in the minority class
AGF = F2.InvF0.5
where F2 is the F-measure where β = 2 and InvF0.5 is calculated by
building a new confusion matrix where the class label of each sample
is switched (i.e. positive samples become negative and vice versa)
Alaa Tharwat 26 / 74
Different Assessment methods Another metrics
4-Markedness(MK): this is defined based on PPV and NPV
metrics
MK = PPV + NPV − 1
This metric sensitive to data changes and hence it is not suitable for
imbalanced data
This is because the Markedness metric depends on PPV and NPV
metrics and both PPV and NPV are sensitive to changes in data
distributions
Alaa Tharwat 27 / 74
Different Assessment methods Another metrics
5-Balanced classification rate or balanced accuracy (BCR): this
metric combines the sensitivity and specificity metrics
BCR =
1
2
(TPR + TNR) =
1
2
(
TP
TP + FN
+
TN
TN + FP
)
Also, Balance error rate (BER) or Half total error rate (HTER)
represents 1 − BCR
Both BCR and BER metrics can be used with imbalanced datasets
Alaa Tharwat 28 / 74
Different Assessment methods Another metrics
6-Geometric Mean (GM): The main goal of all classifiers is to
improve the sensitivity, without sacrificing the specificity
The Geometric Mean (GM) metric aggregates both sensitivity and
specificity measures
GM =
√
TPR × TNR =
TP
TP + FN
×
TN
TN + FP
Adjusted Geometric Mean (AGM) is proposed to obtain as much
information as possible about each class
AGM =
GM+TNR(FP+TN)
1+FP+TN if TPR  0
0 if TPR = 0
Alaa Tharwat 29 / 74
Different Assessment methods Another metrics
GM metric can be used with imbalanced datasets
However, changing the distribution of negative class has a small
influence on the AGM metric and hence it is not suitable with the
imbalanced data
For example, assume the negative class samples are increased by α
times. Thus, the AGM metric is calculated as follows,
AGM = GM+T NR(αF P +αT N)
1+αF P +αT N ; as a consequence, the AGM metric is
slightly affected by the changes in the class distribution
Alaa Tharwat 30 / 74
Different Assessment methods Another metrics
7-Optimization precision (OP)
OP = Acc −
|TPR − TNR|
TPR + TNR
where the second term |TPR−TNR|
TPR+TNR computes how balanced both
class accuracies are and this metric represents the difference between
the global accuracy and that term
High OP value indicates high accuracy and well-balanced class
accuracies
Since the OP metric depends on the accuracy metric, it is not
suitable for imbalanced data
Alaa Tharwat 31 / 74
Different Assessment methods Another metrics
8-Jaccard: This metric is also called Tanimoto similarity coefficient
Jaccard metric explicitly ignores the correct classification of negative
samples as follows
Jaccard =
TP
TP + FP + FN
Jaccard metric is sensitive to changes in data distributions
Alaa Tharwat 32 / 74
Different Assessment methods Another metrics
Visualization of different
metrics and the relations
between these metrics.
Given two classes, red
class and blue class. The
black circle represents a
classifier that classifies
the sample inside the cir-
cle as red samples (be-
long to the red class)
and the samples outside
the circle as blue sam-
ples (belong to the blue
class). Green regions in-
dicate the correctly clas-
sified regions and the red
regions indicate the mis-
classified regions.
Alaa Tharwat 33 / 74
Different Assessment methods Another metrics
Introduction
Classification Performance
Classification metrics with imbalanced data
Different Assessment methods
Illustrative example
Receiver Operating Characteristics (ROC)
Area under the ROC curve (AUC)
Precision-Recall (PR) curve
Biometrics measures
Conclusions
Alaa Tharwat 34 / 74
Illustrative example
In this section, two examples are introduced. These examples explain how
to calculate classification metrics using two classes or multiple classes
Alaa Tharwat 35 / 74
Illustrative example
Binary classification example:
Assume we have two classes (A and B), i.e., binary classification, and
each class has 100 samples
The A class represents the positive class while the B class represents
the negative class. The number of correctly classified samples in
class A and B are 70 and 80, respectively. Hence, the values of TP,
TN, FP, and FN are 70, 80, 20, and 30, respectively
The values of different classification metrics are as follows,
Acc = 70+80
70+80+20+30 = 0.75, TPR = 70
70+30 = 0.7,
TNR = 80
80+20 = 0.8, PPV = 70
70+20 ≈ 0.78, NPV = 80
80+30 ≈ 0.73,
Err = 1 − Acc = 0.25, BCR = 1
2(0.7 + 0.8) = 0.75,
FPR = 1 − 0.8 = 0.2, FNR = 1 − 0.7 = 0.3,
F − measure = 1
2( 1
0.78 + 1
0.7) ≈ 1.36,
OP = Acc − |TPR−TNR|
TPR+TNR = 0.75 − |0.7−0.8|
0.7+0.8 ≈ 0.683,
LR+ = 0.7
1−0.8 = 3.5, LR− = 1−0.7
0.8 = 0.375, DOR = 3.5
0.375 ≈ 9.33,
Y I = 0.7 + 0.8 − 1 = 0.5, and Jaccard = 70
70+20+30 ≈ 0.583
Alaa Tharwat 36 / 74
Illustrative example
Imbalanced data:
We increased the number of samples of the B class to 1000 to show
how the classification metrics are changed when using imbalanced
data, and there are 800 samples from class B were correctly classified
The values of TP, TN, FP, and FN are 70, 800, 200, and 30,
respectively
Only the values of accuracy, precision/PPV, NPV, error rate,
Optimization precision, F-measure, and Jaccard are changed as
follows, Acc = 70+800
70+800+200+30 ≈ 0.79, PPV = 70
70+200 ≈ 0.26,
NPV = 800
800+30 ≈ 0.96, Err = 1 − Acc = 0.21,
OP = 0.79 − |0.7−0.8|
0.7+0.8 ≈ 0.723, F − measure = 1
2( 1
0.26 + 1
0.7) ≈ 2.64,
and Jaccard = 70
70+200+30 ≈ 0.233
Thus, the accuracy, precision, NPV, F-measure, and Jaccard metrics
are sensitive to imbalanced data
Alaa Tharwat 37 / 74
Illustrative example
Multi-classification example:
True Class
A
Predicted
Class
B C
A
B
C
80
70
90
15 0
10
155
15
Figure: Results of a multi-class classification test (our example).
In this example, there are three classes A, B, and C, the results of a
classification test are shown in the figure
The values of TPA, TPB, and TBC are 80, 70, and 90, respectively,
which represent the diagonal
The values of false negative for each class (true class) are calculated
as mentioned before by adding all errors in the column of that class.
Alaa Tharwat 38 / 74
Illustrative example
For example, FNA = EAB + EAC = 15 + 5 = 20, and similarly
FNB = EBA + EBC = 15 + 15 = 30 and
FNC = ECA + ECB = 0 + 10 = 10. The values of false positive for
each class (predicted class) are calculated as mentioned before by
adding all errors in the row of that class
For example, FPA = EBA + ECA = 15 + 0 = 15, and similarly
FPB = EAB + ECB = 15 + 10 = 25 and
FPC = EAC + EBC = 5 + 15 = 20
The value of true negative for the class A (TNA) can be calculated
by adding all columns and rows excluding the row and column of
class A; this is similar to the TN in the 2 × 2 confusion matrix
The value of TNA is calculated as follows,
TNA = 70 + 90 + 10 + 15 = 185, and similarly
TNB = 15 + 10 + 90 + 5 = 120 and
TNC = 15 + 15 + 70 + 5 = 105. Using TP, TN, FP, and FN we
can calculate all classification measures
Alaa Tharwat 39 / 74
Illustrative example
The accuracy is 80+70+90
100+100+100 = 0.8
The sensitivity and specificity are calculated for each class. For
example, the sensitivity of A is TPA
TPA+FNA
= 80
80+15+5 = 0.8, and
similarly the sensitivity of B and C classes are 70
70+15+15 = 0.7 and
90
90+0+10 = 0.9, respectively, and the specificity values of A, B, and C
are 185
185+15 ≈ 0.93, 120
120+25 ≈ 0.82, and 105
105+20 = 0.84, respectively
Alaa Tharwat 40 / 74
Illustrative example
Introduction
Classification Performance
Classification metrics with imbalanced data
Different Assessment methods
Illustrative example
Receiver Operating Characteristics (ROC)
Area under the ROC curve (AUC)
Precision-Recall (PR) curve
Biometrics measures
Conclusions
Alaa Tharwat 41 / 74
Receiver Operating Characteristics (ROC)
The receiver operating characteristics (ROC) curve is a
two-dimensional graph in which the TPR represents the y-axis and
FPR is the x-axis
Any classifier that has discrete outputs such as decision trees is
designed to produce only a class decision, i.e., a decision for each
testing sample, and hence it generates only one confusion matrix
which in turn corresponds to one point into the ROC space
Continuous output classifiers such as the Naive Bayes classifier, the
output is represented by a numeric value, i.e., score, which represents
the degree to which a sample belongs to a specific class
Alaa Tharwat 42 / 74
Receiver Operating Characteristics (ROC)
The ROC curve is generated by changing the threshold on the
confidence score; hence, each threshold generates only one point in
the ROC curve
0
0.2 0.4 0.6 0.8 1.00
0.2
0.4
0.6
0.8
1.0
False positive rate (FPR)
Truepositiverate(TPR)
A
B C
D
Better
W
orse
Optimal classifier
Optimistic
Pessimistic
Expected
C
Figure: A basic ROC curve showing important points, and the optimistic,
pessimistic and expected ROC segments for equally scored samples.
Alaa Tharwat 43 / 74
Receiver Operating Characteristics (ROC)
There are four important points in the ROC curve
The point A, in the lower left corner (0, 0) represents a classifier where
there is no positive classification, while all negative samples are
correctly classified and hence TPR = 0 and FPR = 0
The point C, in the top right corner (1,1), represents a classifier where
all positive samples are correctly classified, while the negative samples
are misclassified
The point D in the lower right corner (1, 0) represents a classifier
where all positive and negative samples are misclassified
The point B in the upper left corner (0, 1) represents a classifier where
all positive and negative samples are correctly classified; thus, this
point represents the perfect classification or the Ideal operating point.
0
0.2 0.4 0.6 0.8 1.00
0.2
0.4
0.6
0.8
1.0
False positive rate (FPR)
Truepositiverate(TPR)
A
B C
D
Better
W
orse
Optimal classifier
Optimistic
Pessimistic
Expected
C
Alaa Tharwat 44 / 74
Receiver Operating Characteristics (ROC)
The perfect classification performance is the green curve which rises
vertically from (0,0) to (0,1) and then horizontally to (1,1). This
curve reflects that the classifier perfectly ranked the positive samples
relative to the negative samples
A point in the ROC space is better than all other points that are in
the southeast, i.e., the points that have lower TPR, higher FPR, or
both
Alaa Tharwat 45 / 74
Receiver Operating Characteristics (ROC)
In the example below, a test set consists of 20 samples from two
classes; each class has ten samples, i.e., ten positive and ten negative
samples
Sample No. Class Score
3
6
18
4
1
2
13
14
7
12
5
11
8
15
16
17
10
19
9
20
t1= ∞
t21= 0.1
Class Score
Sort
P
P
N
P
P
P
N
N
P
N
P
N
P
N
N
N
P
N
P
N
0.82
0.8
0.75
0.7
0.62
0.6
0.54
0.5
0.49
0.45
0.4
0.39
0.37
0.32
0.3
0.26
0.23
0.21
0.19
0.1
∞
t2= 0.82
t3= 0.8
t4= 0.75
t5= 0.7
t6= 0.62
t7= 0.6
t8= 0.54
t9= 0.5
t10=0.49
t11=0.45
t12= 0.4
t13=0.39
t14=0.37
t15=0.32
t16= 0.3
t17=0.26
t18=0.23
t19=0.21
t20=0.19
0 0
10 10
1 0
9 10
2 0
8 10
2 1
8 9
3 1
7 9
6 3
4 7
4 1
6 9
5 3
5 7
5 2
5 8
10 10
0 0
10 9
0 1
9 9
1 1
9 8
1 2
8 8
2 2
7 4
3 6
8 7
2 3
8 6
2 4
7 5
3 5
8 5
2 5
6 4
4 6
5 1
5 9
P
P
N
P
P
P
N
N
P
N
P
N
P
N
N
N
P
N
P
N
0.82
0.8
0.75
0.7
0.62
0.6
0.54
0.5
0.49
0.45
0.4
0.39
0.37
0.32
0.3
0.26
0.23
0.21
0.19
0.1
Sample No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Alaa Tharwat 46 / 74
Receiver Operating Characteristics (ROC)
The initial step to plot the ROC curve is to sort the samples
according to their scores
Next, the threshold value is changed from maximum to minimum to
plot the ROC curve
To scan all samples, the threshold is ranged from ∞ to −∞
The samples are classified into the positive class if their scores are
higher than or equal the threshold; otherwise, it is estimated as
negative
Sample No. Class Score
3
6
18
4
1
2
13
14
7
12
5
11
8
15
16
17
10
19
9
20
t1= ∞
t21= 0.1
Class Score
Sort
P
P
N
P
P
P
N
N
P
N
P
N
P
N
N
N
P
N
P
N
0.82
0.8
0.75
0.7
0.62
0.6
0.54
0.5
0.49
0.45
0.4
0.39
0.37
0.32
0.3
0.26
0.23
0.21
0.19
0.1
∞
t2= 0.82
t3= 0.8
t4= 0.75
t5= 0.7
t6= 0.62
t7= 0.6
t8= 0.54
t9= 0.5
t10=0.49
t11=0.45
t12= 0.4
t13=0.39
t14=0.37
t15=0.32
t16= 0.3
t17=0.26
t18=0.23
t19=0.21
t20=0.19
0 0
10 10
1 0
9 10
2 0
8 10
2 1
8 9
3 1
7 9
6 3
4 7
4 1
6 9
5 3
5 7
5 2
5 8
10 10
0 0
10 9
0 1
9 9
1 1
9 8
1 2
8 8
2 2
7 4
3 6
8 7
2 3
8 6
2 4
7 5
3 5
8 5
2 5
6 4
4 6
5 1
5 9
P
P
N
P
P
P
N
N
P
N
P
N
P
N
N
N
P
N
P
N
0.82
0.8
0.75
0.7
0.62
0.6
0.54
0.5
0.49
0.45
0.4
0.39
0.37
0.32
0.3
0.26
0.23
0.21
0.19
0.1
Sample No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Alaa Tharwat 47 / 74
Receiver Operating Characteristics (ROC)
Changing the threshold value changes the TPR and FPR
At the beginning, the threshold value is set at maximum (t1 = ∞);
hence, all samples are classified as negative samples and the values of
FPR and TPR are zeros and the position of t1 is in the lower left
corner (the point (0,0))
0
0.2 0.4 0.6 0.8 1.00
0.2
0.4
0.6
0.8
1.0
False positive rate (FPR)
Truepositiverate(TPR)
t1
0.1 0.3 0.5 0.7 0.9
0.9
0.7
0.5
0.3
0.1
t2
t3
t4
t5
t6
t7
t8 t9
t10
t11
t12
t13
t14
t15 t16 t17
t18
t19
t20
t21
Figure: An illustrative example of the ROC curve. The values of TPR and
FPR of each point/threshold are calculated in the next Table.
Alaa Tharwat 48 / 74
Receiver Operating Characteristics (ROC)
Next, the threshold value is decreased to 0.82, and the first sample is
classified correctly as a positive sample, and the TPR increased to
0.1, while the FPR remains zero
Increasing the TPR moves the ROC curve up while increasing the
FPR moves the ROC curve to the right as in t4
The ROC curve must pass through the point (0,0) where the
threshold value is ∞ (in which all samples are classified as negative
samples) and the point (1,1) where the threshold is −∞ (in which all
samples are classified as positive samples)
0
0.2 0.4 0.6 0.8 1.00
0.2
0.4
0.6
0.8
1.0
False positive rate (FPR)
Truepositiverate(TPR)
t1
0.1 0.3 0.5 0.7 0.9
0.9
0.7
0.5
0.3
0.1
t2
t3
t4
t5
t6
t7
t8 t9
t10
t11
t12
t13
t14
t15 t16 t17
t18
t19
t20
t21
Alaa Tharwat 49 / 74
Receiver Operating Characteristics (ROC)
Threshold TP FN TN FP TPR FPR FNR PPV Acc
t1 = ∞ 0 10 10 0 0 0 1 - 50
t2 = 0.82 1 9 10 0 0.1 0 0.9 1.0 55
t3 = 0.80 2 8 10 0 0.2 0 0.8 1.0 60
t4 = 0.75 2 8 9 1 0.2 0.1 0.8 0.67 55
t5 = 0.70 3 7 9 1 0.3 0.1 0.7 0.75 60
t6 = 0.62 4 6 9 1 0.4 0.1 0.6 0.80 65
t7 = 0.60 5 5 9 1 0.5 0.1 0.5 0.83 70
t8 = 0.54 5 5 8 2 0.5 0.2 0.5 0.71 65
t9 = 0.50 5 5 7 3 0.5 0.3 0.5 0.63 60
t10 = 0.49 6 4 7 3 0.6 0.3 0.4 0.67 65
t11 = 0.45 6 4 6 4 0.6 0.4 0.4 0.60 60
t12 = 0.40 7 3 6 4 0.7 0.4 0.3 0.64 65
t13 = 0.39 7 3 5 5 0.7 0.5 0.3 0.58 60
t14 = 0.37 8 2 5 5 0.8 0.5 0.2 0.62 65
t15 = 0.32 8 2 4 6 0.8 0.6 0.2 0.57 60
t16 = 0.30 8 2 3 7 0.8 0.7 0.2 0.53 55
t17 = 0.26 8 2 2 8 0.8 0.8 0.2 0.50 50
t18 = 0.23 9 1 2 8 0.9 0.8 0.1 0.53 55
t19 = 0.21 9 1 1 9 0.9 0.9 0.1 0.50 50
t20 = 0.19 10 0 1 9 1.0 0.9 0 0.53 55
t21 = 0.10 10 0 0 10 1.0 1.0 0 0.50 50
Alaa Tharwat 50 / 74
Receiver Operating Characteristics (ROC)
It is clear that the ROC curve is a step function. This is because we
only used 20 samples (a finite set of samples) in our example and a
true curve can be obtained when the number of samples increased
The figure also shows that the best accuracy (70%) (see the Table)
is obtained at (0.1,0.5) when the threshold value was ≥ 0.6, rather
than at ≥ 0.5 as we might expect with a balanced data. This means
that the given learning model identifies positive samples better than
negative samples.
0
0.2 0.4 0.6 0.8 1.00
0.2
0.4
0.6
0.8
1.0
False positive rate (FPR)
Truepositiverate(TPR)
t1
0.1 0.3 0.5 0.7 0.9
0.9
0.7
0.5
0.3
0.1
t2
t3
t4
t5
t6
t7
t8 t9
t10
t11
t12
t13
t14
t15 t16 t17
t18
t19
t20
t21
Alaa Tharwat 51 / 74
Receiver Operating Characteristics (ROC)
TN
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=10 TN=10
TP=0 FP=0
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t1
FN
(a) t1
TN
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=8 TN=10
TP=2 FP=0
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t3
FN
TP
(b) t3
TN
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=5 TN=8
TP=5 FP=2
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t8
FN
TP
FP
(c) t8
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=4 TN=6
TP=6 FP=4
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t11
TP
FN FP
TN
(d) t11
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=2 TN=5
TP=8 FP=5
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t14
TP
TN
FN
FP
(e) t14
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=0 TN=1
TP =10 FP=9
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t20
TN
FP
TP
(f) t19
Figure: A visualization of how changing the threshold changes the TP, TN,
FP, and FN values.
Alaa Tharwat 52 / 74
Receiver Operating Characteristics (ROC)
t1: The value of this threshold was ∞ and hence all samples are
classified as negative samples. This means that (1) all positive
samples are incorrectly classified; hence, the value of TP is zero, (2)
all negative samples are correctly classified and hence there is no FP
TN
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=10 TN=10
TP=0 FP=0
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t1
FN
Alaa Tharwat 53 / 74
Receiver Operating Characteristics (ROC)
t3: The threshold value decreased and as shown there are two
positive samples are correctly classified. Therefore, according to the
positive class, only the positive samples which have scores more than
or equal this threshold (t3) will be correctly classified, i.e., TP, while
the other positive samples are incorrectly classified, i.e., FN. In this
threshold, also all negative samples are correctly classified; thus, the
value of FP is still zero
TN
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=8 TN=10
TP=2 FP=0
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t3
FN
TP
Alaa Tharwat 54 / 74
Receiver Operating Characteristics (ROC)
t8: As the threshold further decreased to be 0.54, the threshold line
moves to the left. This means that more positive samples have the
chance to be correctly classified; on the other hand, some negative
samples are misclassified. As a consequence, the values of TP and
FP are increased and the values of TN and FN decreased
TN
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=5 TN=8
TP=5 FP=2
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t8
FN
TP
FP
Alaa Tharwat 55 / 74
Receiver Operating Characteristics (ROC)
t11: This is an important threshold value where the numbers of
errors from both positive and negative classes are equal (i.e.,
TP = TN = 6 and FP = FN = 4)
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=4 TN=6
TP=6 FP=4
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t11
TP
FN FP
TN
Alaa Tharwat 56 / 74
Receiver Operating Characteristics (ROC)
t14: Reducing the value of the threshold to 0.37 results more
correctly classified positive samples and this increases TP and
reduces FN. On the contrary, more negative samples are
misclassified and this increases FP and reduces TN
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=2 TN=5
TP=8 FP=5
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t14
TP
TN
FN
FP
Alaa Tharwat 57 / 74
Receiver Operating Characteristics (ROC)
t20: Decreasing the threshold value hides the FN area. This is
because all positive samples are correctly classified. Also, from the
figure, it is clear that the FP area is much larger than the area of
TN. This is because 90% of the negative samples are incorrectly
classified, and only 10% of negative samples are correctly classified
Positive (P) Negative (N)
True (T)
False (F)
P=TP+FN N=FP+TN
FN=0 TN=1
TP =10 FP=9
Minimize Threshold
Distribution for
positive Class
Distribution for
negative Class
t20
TN
FP
TP
Alaa Tharwat 58 / 74
Receiver Operating Characteristics (ROC)
1: Given a set of test samples (Stest = {s1, s2, . . . , sN }), where N is the
total number of test samples, P and N represent the total number of
positive and negative samples, respectively.
2: Sort the samples corresponding to their scores, Ssorted is the sorted
samples.
3: FP ← 0, TP ← 0, fprev ← −∞, and ROC = [ ].
4: for i = 1 to |Ssorted| do
5: if f(i) = fprev then
6: ROC(i) ← (FP
N , TP
P ), fprev ← f(i)
7: end if
8: if Ssorted(i) is a positive sample then
9: TP ← TP + 1.
10: else
11: FP ← FP + 1.
12: end if
13: end for
14: ROC(i) ← (FP
N , TP
P ).
Alaa Tharwat 59 / 74
Receiver Operating Characteristics (ROC)
Introduction
Classification Performance
Classification metrics with imbalanced data
Different Assessment methods
Illustrative example
Receiver Operating Characteristics (ROC)
Area under the ROC curve (AUC)
Precision-Recall (PR) curve
Biometrics measures
Conclusions
Alaa Tharwat 60 / 74
Area under the ROC curve (AUC)
Comparing different classifiers in the ROC curve is not easy. This is
because there is no scalar value represents the expected performance
Therefore, the Area under the ROC curve (AUC) metric is used to
calculate the area under the ROC curve
The AUC score is always bounded between zero and one, and there is
no realistic classifier has an AUC lower than 0.5
0
0.2 0.4 0.6 0.8 1.00
0.2
0.4
0.6
0.8
1.0
False positive rate (FPR)
Truepositiverate(TPR)
B
A
FPR1 FPR2
TPR1
TPR2
Figure: An illustrative example of the AUC metric.
Alaa Tharwat 61 / 74
Area under the ROC curve (AUC)
The figure below shows the AUC value of two classifiers, A and B,
and the AUC of B classifier is greater than A; hence, it achieves
better performance
The gray shaded area is common in both classifiers, while the red
shaded area represents the area where the B classifier outperforms
the A classifier
It is possible for a lower AUC classifier to outperform a higher AUC
classifier in a specific region. For example, the classifier B
outperforms A except at FPR  0.6 where A has a slight difference
(blue shaded area)
0
0.2 0.4 0.6 0.8 1.00
0.2
0.4
0.6
0.8
1.0
False positive rate (FPR)
Truepositiverate(TPR)
B
A
FPR1 FPR2
TPR1
TPR2
Alaa Tharwat 62 / 74
Area under the ROC curve (AUC)
However, two classifiers with two different ROC curves may have the
same AUC score
The steps in the AUC Algorithm represent a slight modification from
the ROC Algorithm. In other words, instead of generating ROC
points, while the AUC adds areas of trapezoids1 of the ROC curve
The AUC score can be calculated by adding the areas of trapezoids
of the AUC measure
The AUC can be also calculated under the Precision-Recall curve
using the trapezoidal rule as in the ROC curve, and the AUC score of
the perfect classifier in PR curves is one as in ROC curves.
1
A trapezoid is a 4-sided shape with two parallel sides.
Alaa Tharwat 63 / 74
Area under the ROC curve (AUC)
Introduction
Classification Performance
Classification metrics with imbalanced data
Different Assessment methods
Illustrative example
Receiver Operating Characteristics (ROC)
Area under the ROC curve (AUC)
Precision-Recall (PR) curve
Biometrics measures
Conclusions
Alaa Tharwat 64 / 74
Precision-Recall (PR) curve
Precision and recall metrics are widely used for evaluating the
classification performance. The Precision-Recall (PR) curve has the
same concept of the ROC curve, and it can be generated by
changing the threshold as in ROC
However, the ROC curve shows the relation between sensitivity/recall
(TPR) and 1-specificity (FPR) while the PR curve shows the
relationship between recall and precision
Thus, in the PR curve, the x-axis is the recall and the y-axis is the
precision, i.e., the x-axis of ROC curve is the y-axis of PR curve.
Hence, in the PR curve, there is no need for the TN value
0
0.2 0.4 0.6 0.8 1.00
0.2
0.4
0.6
0.8
1.0
Recall
Precision
t1
0.1 0.3 0.5 0.7 0.9
0.9
0.7
0.5
0.3
0.1
t2 t3
t4
t5
t6
t7
t8
t9
t10
t11
t12
t13
t14
t15
t16
t17
t18
t19
t20
t21
Optimal classifier
Alaa Tharwat 65 / 74
Precision-Recall (PR) curve
In the PR curve, the precision value for the first point is undefined
because the number of positive predictions is zero, i.e., TP = 0 and
FP = 0. This problem can be solved by estimating the first point in
the PR curve from the second point. There are two cases for
estimating the first point depending on the value of TP of the
second point:
1 The number of true positives of the second point is zero: In this case,
since the second point is (0,0), the first point is also (0,0)
2 The number of true positives of the second point is not zero: this is
similar to our example where the second point is (0.1, 1.0). The first
point can be estimated by drawing a horizontal line from the second
point to the y-axis. Thus, the first point is estimated as (0.0, 1.0)
Alaa Tharwat 66 / 74
Precision-Recall (PR) curve
The PR curve is often zigzag curve; hence, PR curves tend to cross
each other much more frequently than ROC curves
In the PR curve, a curve above the other has a better classification
performance
The perfect classification performance in the PR curve is represented
by a green curve, and this curve starts from the (0,1) horizontally to
(1,1) and then vertically to (1,0), where (0,1) represents a classifier
that achieves 100% precision and 0% recall, (1,1) represents a
classifier that obtains 100% precision and sensitivity and this is the
ideal point in the PR curve, and (1,0) indicates the classifier obtains
100% sensitivity and 0% precision
The PR curve is to the upper right corner, the better the
classification performance is. Since the PR curve depends only on
the precision and recall measures, it ignores the performance of
correctly handling negative examples (TN)
Alaa Tharwat 67 / 74
Precision-Recall (PR) curve
Introduction
Classification Performance
Classification metrics with imbalanced data
Different Assessment methods
Illustrative example
Receiver Operating Characteristics (ROC)
Area under the ROC curve (AUC)
Precision-Recall (PR) curve
Biometrics measures
Conclusions
Alaa Tharwat 68 / 74
Biometrics measures
Biometrics matching is slightly different than the other classification
problems and hence it is sometimes called two-instance problem
In this problem, instead of classifying one sample into one of c
groups or classes, biometric determines if the two samples are in the
same group. This can be achieved by identifying an unknown sample
by matching it with all the other known samples
The model assigns the unknown sample to the person which has the
most similar score. If this level of similarity is not reached, the
sample is rejected
Theoretically, scores of clients (persons known by the biometric
system) should always be higher than the scores of imposters
(persons who are not known by the system)
In biometric systems, a single threshold separates the two groups of
scores, i.e., clients and imposters
In real applications, sometimes imposter samples generate scores
higher than the scores of some client samples. Accordingly, it is a
fact that however the classification threshold is perfectly chosen,
some classification errors occur
Alaa Tharwat 69 / 74
Biometrics measures
Two of the most commonly used measures in biometrics are the False
acceptance rate (FAR) and False rejection/recognition rate (FRR)
The FAR is also called false match rate (FMR) and it is the ratio
between the number of false acceptance to the total number of
imposters attempts
Hence, to prevent imposter samples from being easily correctly
identified by the model, the similarity score has to exceed a certain
level
False rejection
rate
Decision Threshold
False acceptance
rate
Errorrate(%)
AcceptedRejected
FRR (%)
FAR(%)
100 %
100 %
Equal error rate
(EER)
Minimize Threshold
Genuine
attempts
Imposter
attempts
One possible
threshold value
Score
Match
Non-match
Alaa Tharwat 70 / 74
Biometrics measures
The FRR or false non-match rate (FNMR) measures the likelihood
that the biometric model will incorrectly reject a client, and it
represents the ratio between the number of false recognitions to the
total number of clients’ attempts
Equal error rate (EER) measure solves the problem of selecting a
threshold value partially, and it represents the failure rate when the
values of FMR and FNMR are equal
False rejection
rate
Decision Threshold
False acceptance
rate
Errorrate(%)
AcceptedRejected
FRR (%)
FAR(%)
100 %
100 %
Equal error rate
(EER)
Minimize Threshold
Genuine
attempts
Imposter
attempts
One possible
threshold value
Score
Match
Non-match
Alaa Tharwat 71 / 74
Biometrics measures
Detection Error Trade-off (DET) curve is used for evaluating
biometric models
In this curve, as in the ROC and PR curves, the threshold value is
changed and the values of FAR and FRR are calculated at each
threshold
This curve shows the relation between FAR and FRR
The ideal point in this curve is the origin point where the values of
both FRR and FAR are zeros and hence the perfect classification
performance in the DET curve is represented by a green curve
0
0.2 0.4 0.6 0.8 1.00
0.2
0.4
0.6
0.8
1.0
False rejection rate (FRR)
Falseacceptancerate(FAR)
t1
0.1 0.3 0.5 0.7 0.9
0.9
0.7
0.5
0.3
0.1
t2
t3
t4
t5
t6
t7 t8
t9
t10
t11
t12
t13
t14
t15
t16
t17
t18
t19 t20
t21
Optimal
classifier
Alaa Tharwat 72 / 74
Biometrics measures
Introduction
Classification Performance
Classification metrics with imbalanced data
Different Assessment methods
Illustrative example
Receiver Operating Characteristics (ROC)
Area under the ROC curve (AUC)
Precision-Recall (PR) curve
Biometrics measures
Conclusions
Alaa Tharwat 73 / 74
Conclusions
For more details, download this paper ”Alaa Tharwat Classification
Assessment Methods, Applied Computing and Informatics, 2018”.
Link to the paper
https://www.sciencedirect.com/science/article/pii/S2210832718301546
Alaa Tharwat 74 / 74View publication statsView publication stats

More Related Content

What's hot

Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control StudySatish Gupta
 
An introduction to ROC analysis
An introduction to ROC analysisAn introduction to ROC analysis
An introduction to ROC analysisDr. Volkan OBAN
 
Statistics
StatisticsStatistics
Statisticsitutor
 
ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliMDO_Lab
 
PEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliPEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliMDO_Lab
 
Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithmHoopeer Hoopeer
 
Statistical termsforclassification
Statistical termsforclassificationStatistical termsforclassification
Statistical termsforclassificationsurabhi_dwivedi
 
Chi square test for homgeneity
Chi square test for homgeneityChi square test for homgeneity
Chi square test for homgeneityamylute
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spssDr Nisha Arora
 
Advance Statistics - Wilcoxon Signed Rank Test
Advance Statistics - Wilcoxon Signed Rank TestAdvance Statistics - Wilcoxon Signed Rank Test
Advance Statistics - Wilcoxon Signed Rank TestJoshua Batalla
 
s.analysis
s.analysiss.analysis
s.analysiskavi ...
 
Basic Statistical Concepts and Methods
Basic Statistical Concepts and MethodsBasic Statistical Concepts and Methods
Basic Statistical Concepts and MethodsAhmed-Refat Refat
 
Functional Forms of Regression Models | Eonomics
Functional Forms of Regression Models | EonomicsFunctional Forms of Regression Models | Eonomics
Functional Forms of Regression Models | EonomicsTransweb Global Inc
 
What is ARIMAX Forecasting and How is it Used for Enterprise Analysis?
What is ARIMAX Forecasting and How is it Used for Enterprise Analysis?What is ARIMAX Forecasting and How is it Used for Enterprise Analysis?
What is ARIMAX Forecasting and How is it Used for Enterprise Analysis?Smarten Augmented Analytics
 

What's hot (19)

Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
Logistic regression analysis
Logistic regression analysisLogistic regression analysis
Logistic regression analysis
 
An introduction to ROC analysis
An introduction to ROC analysisAn introduction to ROC analysis
An introduction to ROC analysis
 
Statistics
StatisticsStatistics
Statistics
 
ModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_AliModelSelection1_WCSMO_2013_Ali
ModelSelection1_WCSMO_2013_Ali
 
Correlation
Correlation  Correlation
Correlation
 
Stratification
Stratification Stratification
Stratification
 
PEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliPEMF2_SDM_2012_Ali
PEMF2_SDM_2012_Ali
 
Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithm
 
Statistical termsforclassification
Statistical termsforclassificationStatistical termsforclassification
Statistical termsforclassification
 
Oneway anova and experimental design
Oneway anova and experimental designOneway anova and experimental design
Oneway anova and experimental design
 
Chi square test for homgeneity
Chi square test for homgeneityChi square test for homgeneity
Chi square test for homgeneity
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spss
 
Advance Statistics - Wilcoxon Signed Rank Test
Advance Statistics - Wilcoxon Signed Rank TestAdvance Statistics - Wilcoxon Signed Rank Test
Advance Statistics - Wilcoxon Signed Rank Test
 
s.analysis
s.analysiss.analysis
s.analysis
 
Basic Statistical Concepts and Methods
Basic Statistical Concepts and MethodsBasic Statistical Concepts and Methods
Basic Statistical Concepts and Methods
 
Functional Forms of Regression Models | Eonomics
Functional Forms of Regression Models | EonomicsFunctional Forms of Regression Models | Eonomics
Functional Forms of Regression Models | Eonomics
 
Sampling theory
Sampling theorySampling theory
Sampling theory
 
What is ARIMAX Forecasting and How is it Used for Enterprise Analysis?
What is ARIMAX Forecasting and How is it Used for Enterprise Analysis?What is ARIMAX Forecasting and How is it Used for Enterprise Analysis?
What is ARIMAX Forecasting and How is it Used for Enterprise Analysis?
 

Similar to Classification assessment methods

Module 3_ Classification.pptx
Module 3_ Classification.pptxModule 3_ Classification.pptx
Module 3_ Classification.pptxnikshaikh786
 
SAS Ttest Procedure.pptx
SAS Ttest Procedure.pptxSAS Ttest Procedure.pptx
SAS Ttest Procedure.pptxAyushMistry11
 
Classification Assessment Methods.pptx
Classification Assessment  Methods.pptxClassification Assessment  Methods.pptx
Classification Assessment Methods.pptxRiadh Al-Haidari
 
Measurement System Analysis (MSA)
Measurement System Analysis (MSA)Measurement System Analysis (MSA)
Measurement System Analysis (MSA)Ram Kumar
 
MODEL EVALUATION.pptx
MODEL  EVALUATION.pptxMODEL  EVALUATION.pptx
MODEL EVALUATION.pptxKARPAGAMT3
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxbelay41
 
datamining-lect11.pptx
datamining-lect11.pptxdatamining-lect11.pptx
datamining-lect11.pptxRithikRaj25
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedDataminingTools Inc
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learnedweka Content
 
13ClassifierPerformance.pdf
13ClassifierPerformance.pdf13ClassifierPerformance.pdf
13ClassifierPerformance.pdfssuserdce5c21
 
MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.AmnaArooj13
 
Medical Statistics Part-II:Inferential statistics
Medical Statistics Part-II:Inferential  statisticsMedical Statistics Part-II:Inferential  statistics
Medical Statistics Part-II:Inferential statisticsRamachandra Barik
 
Trabajo de estadistica dominith paola echenique
Trabajo de estadistica dominith paola echeniqueTrabajo de estadistica dominith paola echenique
Trabajo de estadistica dominith paola echeniquePaolaEchenique2
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Genetic Programming for Generating Prototypes in Classification Problems
Genetic Programming for Generating Prototypes in Classification ProblemsGenetic Programming for Generating Prototypes in Classification Problems
Genetic Programming for Generating Prototypes in Classification ProblemsTarundeep Dhot
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
Statistics
StatisticsStatistics
Statisticspikuoec
 
PERFORMANCE_PREDICTION__PARAMETERS[1].pptx
PERFORMANCE_PREDICTION__PARAMETERS[1].pptxPERFORMANCE_PREDICTION__PARAMETERS[1].pptx
PERFORMANCE_PREDICTION__PARAMETERS[1].pptxTAHIRZAMAN81
 

Similar to Classification assessment methods (20)

Module 3_ Classification.pptx
Module 3_ Classification.pptxModule 3_ Classification.pptx
Module 3_ Classification.pptx
 
Ad test
Ad testAd test
Ad test
 
SAS Ttest Procedure.pptx
SAS Ttest Procedure.pptxSAS Ttest Procedure.pptx
SAS Ttest Procedure.pptx
 
Classification Assessment Methods.pptx
Classification Assessment  Methods.pptxClassification Assessment  Methods.pptx
Classification Assessment Methods.pptx
 
Measurement System Analysis (MSA)
Measurement System Analysis (MSA)Measurement System Analysis (MSA)
Measurement System Analysis (MSA)
 
MODEL EVALUATION.pptx
MODEL  EVALUATION.pptxMODEL  EVALUATION.pptx
MODEL EVALUATION.pptx
 
ML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptxML-ChapterFour-ModelEvaluation.pptx
ML-ChapterFour-ModelEvaluation.pptx
 
datamining-lect11.pptx
datamining-lect11.pptxdatamining-lect11.pptx
datamining-lect11.pptx
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
 
13ClassifierPerformance.pdf
13ClassifierPerformance.pdf13ClassifierPerformance.pdf
13ClassifierPerformance.pdf
 
MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.
 
Medical Statistics Part-II:Inferential statistics
Medical Statistics Part-II:Inferential  statisticsMedical Statistics Part-II:Inferential  statistics
Medical Statistics Part-II:Inferential statistics
 
Trabajo de estadistica dominith paola echenique
Trabajo de estadistica dominith paola echeniqueTrabajo de estadistica dominith paola echenique
Trabajo de estadistica dominith paola echenique
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Genetic Programming for Generating Prototypes in Classification Problems
Genetic Programming for Generating Prototypes in Classification ProblemsGenetic Programming for Generating Prototypes in Classification Problems
Genetic Programming for Generating Prototypes in Classification Problems
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
Statistics
StatisticsStatistics
Statistics
 
PERFORMANCE_PREDICTION__PARAMETERS[1].pptx
PERFORMANCE_PREDICTION__PARAMETERS[1].pptxPERFORMANCE_PREDICTION__PARAMETERS[1].pptx
PERFORMANCE_PREDICTION__PARAMETERS[1].pptx
 

Recently uploaded

Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Chameera Dedduwage
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsaqsarehman5055
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardsticksaastr
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Vipesco
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxraffaeleoman
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxmohammadalnahdi22
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMoumonDas2
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Hasting Chen
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaKayode Fayemi
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxNikitaBankoti2
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar TrainingKylaCullinane
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Kayode Fayemi
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...Sheetaleventcompany
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Delhi Call girls
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubssamaasim06
 

Recently uploaded (20)

Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptx
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 

Classification assessment methods

  • 1. See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/327403649 Classification assessment methods: a detailed tutorial Presentation · September 2018 CITATIONS 0 READS 5,817 1 author: Some of the authors of this publication are also working on these related projects: Enzyme Classification and Prediction View project Statistical tests View project Alaa Tharwat Frankfurt University of Applied Sciences 107 PUBLICATIONS   1,031 CITATIONS    SEE PROFILE All content following this page was uploaded by Alaa Tharwat on 03 September 2018. The user has requested enhancement of the downloaded file.
  • 2. Classification assessment method: a detailed tutorial Alaa Tharwat Email: engalaatharwat@hotmail.com Alaa Tharwat 1 / 74
  • 3. Introduction Classification Performance Classification metrics with imbalanced data Different Assessment methods Illustrative example Receiver Operating Characteristics (ROC) Area under the ROC curve (AUC) Precision-Recall (PR) curve Biometrics measures Conclusions Alaa Tharwat 2 / 74
  • 4. Introduction Classification Performance Classification metrics with imbalanced data Different Assessment methods Illustrative example Receiver Operating Characteristics (ROC) Area under the ROC curve (AUC) Precision-Recall (PR) curve Biometrics measures Conclusions Alaa Tharwat 3 / 74
  • 5. Introduction Classification techniques have been applied to many applications in various fields of sciences In classification models, the training data are used for building a classification model to predict the class label for a new sample The outputs of classification models can be discrete as in the decision tree classifier or continuous as the Naive Bayes classifier The outputs of learning algorithms need to be assessed and analyzed carefully and this analysis must be interpreted correctly, so as to evaluate different learning algorithms. The classification performance is represented by scalar values as in different metrics such as accuracy, sensitivity, and specificity Graphical assessment methods such as Receiver operating characteristics (ROC) and Precision-Recall curves give different interpretations of the classification performance Some of the measures which are derived from the confusion matrix for evaluating a diagnostic test Alaa Tharwat 4 / 74
  • 6. Introduction Introduction Classification Performance Classification metrics with imbalanced data Different Assessment methods Illustrative example Receiver Operating Characteristics (ROC) Area under the ROC curve (AUC) Precision-Recall (PR) curve Biometrics measures Conclusions Alaa Tharwat 5 / 74
  • 7. Classification Performance According to the number of classes, there are two types of classification problems, namely, binary classification where there are only two classes, and multi-class classification where the number of classes is higher than two Assume we have two classes, i.e., binary classification, P for positive class and N for negative class An unknown sample is classified to P or N The classification model that was trained in the training phase is used to predict the true classes of unknown samples This classification model produces continuous or discrete outputs. The discrete output that is generated from a classification model represents the predicted discrete class label of the unknown/test sample, while continuous output represents the estimation of the sample’s class membership probability Alaa Tharwat 6 / 74
  • 8. Classification Performance There are four possible outputs which represent the elements of a 2 × 2 confusion matrix or a contingency table If the sample is positive and it is classified as positive, i.e., correctly classified positive sample, it is counted as a true positive (TP); if it is classified as negative, it is considered as a false negative (FN) If the sample is negative and it is classified as negative it is considered as true negative (TN); if it is classified as positive, it is counted as false positive (FP), false alarm The confusion matrix is used to calculate many common classification metrics The green diagonal represents correct predictions and the pink diagonal indicates the incorrect predictions True/Actual Class Positive (P) Negative (N) Predicted Class True (T) False (F) P=TP+FN N=FP+TN True Positive (TP) True Negative (TN) False Negative (FN) False Positive (FP) Figure: An illustrative example of the 2 × 2 confusion matrix. There are two true classes P and N. The output of the predicted class is true or false. Alaa Tharwat 7 / 74
  • 9. Classification Performance confusion matrix for multi-class classification problems The figure below shows the confusion matrix for a multi-class classification problem with three classes (A, B, and C) TPA is the number of true positive samples in class A, i.e., the number of samples that are correctly classified from class A EAB is the samples from class A that were incorrectly classified as class B, i.e., misclassified samples. Thus, the false negative in the A class (FNA) is the sum of EAB and EAC (FNA = EAB + EAC) which indicates the sum of all class A samples that were incorrectly classified as class B or C Simply, FN of any class which is located in a column can be calculated by adding the errors in that class/column The false positive for any predicted class which is located in a row represents the sum of all errors in that row True Class A Predicted Class B C A B C TPA TPB TPC EBA ECA ECB EBCEAC EAB Alaa Tharwat 8 / 74
  • 10. Classification Performance confusion matrix for multi-class classification problems Introduction Classification Performance Classification metrics with imbalanced data Different Assessment methods Illustrative example Receiver Operating Characteristics (ROC) Area under the ROC curve (AUC) Precision-Recall (PR) curve Biometrics measures Conclusions Alaa Tharwat 9 / 74
  • 11. Classification metrics with imbalanced data Different assessment methods are sensitive to the imbalanced data when the samples of one class in a dataset outnumber the samples of the other class(es) The class distribution is the ratio between the positive and negative samples ( P N ) represents the relationship between the left column to the right column Any assessment metric that uses values from both columns will be sensitive to the imbalanced data For example, some metrics such as accuracy and precision use values from both columns of the confusion matrix; thus, as data distributions change, these metrics will change as well, even if the classifier performance does not This fact is partially true because there are some metrics such as Geometric Mean (GM) and Youden’s index (Y I) use values from both columns and these metrics can be used with balanced and imbalanced data This can be interpreted as that the metrics which use values from one column cancel the changes in the class distribution Alaa Tharwat 10 / 74
  • 12. Classification metrics with imbalanced data However, some metrics which use values from both columns are not sensitive to the imbalanced data because the changes in the class distribution cancel each other For example, the accuracy is defined as follows, Acc = T P +T N T P +T N+F P +F N and the GM is defined as follows, GM = √ TPR × TNR = T P T P +F N × T N T N+F P ; thus, both metrics use values from both columns of the confusion matrix Changing the class distribution can be obtained by increasing/decreasing the number of samples of negative/positive class With the same classification performance, assume that the negative class samples are increased by α times; thus, the TN and FP values will be αTN and αFP, respectively; thus, the accuracy will be, Acc = T P +αT N T P +αT N+αF P +F N = T P +T N T P +T N+F P +F N . This means that the accuracy is affected by the changes in the class distribution On the other hand, the GM metric will be, GM = T P T P +F N × αT N αT N+αF P = T P T P +F N × T N T N+F P and hence the changes in the negative class cancel each other. This is the reason why the GM metric is suitable for the imbalanced data Alaa Tharwat 11 / 74
  • 13. Classification metrics with imbalanced data Introduction Classification Performance Classification metrics with imbalanced data Different Assessment methods Illustrative example Receiver Operating Characteristics (ROC) Area under the ROC curve (AUC) Precision-Recall (PR) curve Biometrics measures Conclusions Alaa Tharwat 12 / 74
  • 14. Different Assessment methods Accuracy and error rate Accuracy (Acc) is one of the most commonly used measures for the classification performance Acc = TP + TN TP + TN + FP + FN where P and N indicate the number of positive and negative samples, respectively. The complement of the accuracy metric is the Error rate (ERR) or misclassification rate. This metric represents the number of misclassified samples from both positive and negative classes, and it is calculated as follows, EER = 1 − Acc = (FP + FN)/(TP + TN + FP + FN) Both accuracy and error rate metrics are sensitive to the imbalanced data Another problem with the accuracy is that two classifiers can yield the same accuracy but perform differently with respect to the types of correct and incorrect decisions they provide Alaa Tharwat 13 / 74
  • 15. Different Assessment methods Sensitivity and specificity Sensitivity, True positive rate (TPR), hit rate, or recall, of a classifier represents the positive correctly classified samples to the total number of positive samples, and it is estimated specificity, True negative rate (TNR), or inverse recall is expressed as the ratio of the correctly classified negative samples to the total number of negative samples TPR = TP TP + FN = TP P , TNR = TN FP + TN = TN N Generally, we can consider sensitivity and specificity as two kinds of accuracy, where the first for actual positive samples and the second for actual negative samples Sensitivity depends on TP and FN which are in the same column of the confusion matrix, and similarly, the specificity metric depends on TN and FP which are in the same column; hence, both sensitivity and specificity can be used for evaluating the classification performance with imbalanced data Alaa Tharwat 14 / 74
  • 16. Different Assessment methods Sensitivity and specificity The accuracy can also be defined in terms of sensitivity and specificity Acc = TP + TN TP + TN + FP + FN = TPR × P P + N + TNR × N P + N = TP TP + FN P P + N + TN TN + FP N P + N = TP P + N + TN P + N = TP + TN TP + TN + FP + FN Alaa Tharwat 15 / 74
  • 17. Different Assessment methods False positive and False negative rates False positive rate (FPR) is also called false alarm rate (FAR), or Fallout, and it represents the ratio between the incorrectly classified negative samples to the total number of negative samples. It is the proportion of the negative samples that were incorrectly classified. Hence, it complements the specificity FPR = 1 − TNR = FP FP + TN = FP N The False negative rate (FNR) or miss rate is the proportion of positive samples that were incorrectly classified. Thus, it complements the sensitivity measure FNR = 1 − TPR = FN FN + TP = FN P Both FPR and FNR are not sensitive to changes in data distributions and hence both metrics can be used with imbalanced data Alaa Tharwat 16 / 74
  • 18. Different Assessment methods Predictive values Predictive values (positive and negative) reflect the performance of the prediction Positive prediction value (PPV ) or precision represents the proportion of positive samples that were correctly classified to the total number of positive predicted samples PPV = Precision = TP FP + TP = 1 − FDR Negative predictive value (NPV ), inverse precision, or true negative accuracy (TNA) measures the proportion of negative samples that were correctly classified to the total number of negative predicted samples NPV = TN FN + TN = 1 − FOR These two measures are sensitive to the imbalanced data False discovery rate (FDR) and False omission rate (FOR) measures complements the PPV and NPV, respectively Alaa Tharwat 17 / 74
  • 19. Different Assessment methods Predictive values The accuracy can be defined in terms of precision and inverse precision Acc = TP + FP P + N × PPV + TN + FN P + N × NPV = TP + FP P + N × TP TP + FP + TN + FN P + N × TN TN + FN = TP + TN TP + TN + FP + FN Alaa Tharwat 18 / 74
  • 20. Different Assessment methods Likelihood ratio The likelihood ratio combines both sensitivity and specificity, and it is used in diagnostic tests In that tests, not all positive results are true positives and also the same for negative results; hence, the positive and negative results change the probability/likelihood of diseases. Likelihood ratio measures the influence of a result on the probability Positive likelihood (LR+) measures how much the odds of the disease increases when a diagnostic test is positive. Similarly, Negative likelihood (LR−) measures how much the odds of the disease decreases when a diagnostic test is negative LR+ = TPR 1 − TNR = TPR FPR , LR− = 1 − TPR TNR Both measures depend on the sensitivity and specificity measures; thus, they are suitable for balanced and imbalanced data Alaa Tharwat 19 / 74
  • 21. Different Assessment methods Likelihood ratio Both LR+ and LR− are combined into one measure which summarizes the performance of the test, this measure is called Diagnostic odds ratio (DOR) The DOR metric represents the ratio between the positive likelihood ratio to the negative likelihood ratio This measure is utilized for estimating the discriminative ability of the test and also for comparing between two diagnostic tests DOR = LR+ LR− = TPR 1 − TNR × TNR 1 − TPR = TP × TN FP × FN Alaa Tharwat 20 / 74
  • 22. Different Assessment methods Youden’s index Youden’s index (Y I) or Bookmaker Informedness (BM) metric is one of the well-known diagnostic tests It evaluates the discriminative power of the test The formula of Youden’s index combines the sensitivity and specificity as in the DOR metric Y I = TPR + TNR − 1 The Y I metric is ranged from zero when the test is poor to one which represents a perfect diagnostic test. It is also suitable with imbalanced data One of the major disadvantages of this test is that it does not change concerning the differences between the sensitivity and specificity of the test For example, given two tests, the sensitivity values for the first and second tests are 0.7 and 0.9, respectively, and the specificity values for the first and second tests are 0.8 and 0.6, respectively; the Y I value for both tests is 0.5 Alaa Tharwat 21 / 74
  • 23. Different Assessment methods Another metrics There are many different metrics that can be calculated from the previous metrics 1- Matthews correlation coefficient (MCC): this metric represents the correlation between the observed and predicted classifications, and it is calculated directly from the confusion matrix MCC = TP × TN − FP × FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) = TP N − TPR × PPV PPV × TPR(1 − TPR)(1 − PPV ) A coefficient of +1 indicates a perfect prediction, −1 represents total disagreement between prediction and true values and zero means that no better than random prediction This metric is sensitive to imbalanced data Alaa Tharwat 22 / 74
  • 24. Different Assessment methods Another metrics 2-Discriminant power (DP): this measure depends on the sensitivity and specificity This metric evaluates how well the classification model distinguishes between positive and negative samples DP = √ 3 π (log( TPR 1 − TNR ) + log( TNR 1 − TPR )) Since this metric depends on the sensitivity and specificity metrics; it can be used with imbalanced data. Alaa Tharwat 23 / 74
  • 25. Different Assessment methods Another metrics 3-F-measure: this is also called F1-score, and it represents the harmonic mean of precision and recall F-measure = 2PPV × TPR PPV + TPR = 2TP 2TP + FP + FN The value of F-measure is ranged from zero to one, and high values of F-measure indicate high classification performance Alaa Tharwat 24 / 74
  • 26. Different Assessment methods Another metrics F-measure has another variant which is called Fβ-measure. This variant represents the weighted harmonic mean between precision and recall Fβ-measure = (1 + β2 ) PPV.TPR β2PPV + TPR = (1 + β2)TP (1 + β2)TP + β2FN + FP This metric is sensitive to changes in data distributions Assume that the negative class samples are increased by α times; thus, the F-measure is calculated as follows, F-measure = 2T P 2T P +αF P +αF N and hence this metric is affected by the changes in the class distribution Alaa Tharwat 25 / 74
  • 27. Different Assessment methods Another metrics The F-measures used only three of the four elements of the confusion matrix and hence two classifiers with different TNR values may have the same F-score Therefore, the Adjusted F-measure (AGF) metric is introduced to use all elements of the confusion matrix and provide more weights to samples which are correctly classified in the minority class AGF = F2.InvF0.5 where F2 is the F-measure where β = 2 and InvF0.5 is calculated by building a new confusion matrix where the class label of each sample is switched (i.e. positive samples become negative and vice versa) Alaa Tharwat 26 / 74
  • 28. Different Assessment methods Another metrics 4-Markedness(MK): this is defined based on PPV and NPV metrics MK = PPV + NPV − 1 This metric sensitive to data changes and hence it is not suitable for imbalanced data This is because the Markedness metric depends on PPV and NPV metrics and both PPV and NPV are sensitive to changes in data distributions Alaa Tharwat 27 / 74
  • 29. Different Assessment methods Another metrics 5-Balanced classification rate or balanced accuracy (BCR): this metric combines the sensitivity and specificity metrics BCR = 1 2 (TPR + TNR) = 1 2 ( TP TP + FN + TN TN + FP ) Also, Balance error rate (BER) or Half total error rate (HTER) represents 1 − BCR Both BCR and BER metrics can be used with imbalanced datasets Alaa Tharwat 28 / 74
  • 30. Different Assessment methods Another metrics 6-Geometric Mean (GM): The main goal of all classifiers is to improve the sensitivity, without sacrificing the specificity The Geometric Mean (GM) metric aggregates both sensitivity and specificity measures GM = √ TPR × TNR = TP TP + FN × TN TN + FP Adjusted Geometric Mean (AGM) is proposed to obtain as much information as possible about each class AGM = GM+TNR(FP+TN) 1+FP+TN if TPR 0 0 if TPR = 0 Alaa Tharwat 29 / 74
  • 31. Different Assessment methods Another metrics GM metric can be used with imbalanced datasets However, changing the distribution of negative class has a small influence on the AGM metric and hence it is not suitable with the imbalanced data For example, assume the negative class samples are increased by α times. Thus, the AGM metric is calculated as follows, AGM = GM+T NR(αF P +αT N) 1+αF P +αT N ; as a consequence, the AGM metric is slightly affected by the changes in the class distribution Alaa Tharwat 30 / 74
  • 32. Different Assessment methods Another metrics 7-Optimization precision (OP) OP = Acc − |TPR − TNR| TPR + TNR where the second term |TPR−TNR| TPR+TNR computes how balanced both class accuracies are and this metric represents the difference between the global accuracy and that term High OP value indicates high accuracy and well-balanced class accuracies Since the OP metric depends on the accuracy metric, it is not suitable for imbalanced data Alaa Tharwat 31 / 74
  • 33. Different Assessment methods Another metrics 8-Jaccard: This metric is also called Tanimoto similarity coefficient Jaccard metric explicitly ignores the correct classification of negative samples as follows Jaccard = TP TP + FP + FN Jaccard metric is sensitive to changes in data distributions Alaa Tharwat 32 / 74
  • 34. Different Assessment methods Another metrics Visualization of different metrics and the relations between these metrics. Given two classes, red class and blue class. The black circle represents a classifier that classifies the sample inside the cir- cle as red samples (be- long to the red class) and the samples outside the circle as blue sam- ples (belong to the blue class). Green regions in- dicate the correctly clas- sified regions and the red regions indicate the mis- classified regions. Alaa Tharwat 33 / 74
  • 35. Different Assessment methods Another metrics Introduction Classification Performance Classification metrics with imbalanced data Different Assessment methods Illustrative example Receiver Operating Characteristics (ROC) Area under the ROC curve (AUC) Precision-Recall (PR) curve Biometrics measures Conclusions Alaa Tharwat 34 / 74
  • 36. Illustrative example In this section, two examples are introduced. These examples explain how to calculate classification metrics using two classes or multiple classes Alaa Tharwat 35 / 74
  • 37. Illustrative example Binary classification example: Assume we have two classes (A and B), i.e., binary classification, and each class has 100 samples The A class represents the positive class while the B class represents the negative class. The number of correctly classified samples in class A and B are 70 and 80, respectively. Hence, the values of TP, TN, FP, and FN are 70, 80, 20, and 30, respectively The values of different classification metrics are as follows, Acc = 70+80 70+80+20+30 = 0.75, TPR = 70 70+30 = 0.7, TNR = 80 80+20 = 0.8, PPV = 70 70+20 ≈ 0.78, NPV = 80 80+30 ≈ 0.73, Err = 1 − Acc = 0.25, BCR = 1 2(0.7 + 0.8) = 0.75, FPR = 1 − 0.8 = 0.2, FNR = 1 − 0.7 = 0.3, F − measure = 1 2( 1 0.78 + 1 0.7) ≈ 1.36, OP = Acc − |TPR−TNR| TPR+TNR = 0.75 − |0.7−0.8| 0.7+0.8 ≈ 0.683, LR+ = 0.7 1−0.8 = 3.5, LR− = 1−0.7 0.8 = 0.375, DOR = 3.5 0.375 ≈ 9.33, Y I = 0.7 + 0.8 − 1 = 0.5, and Jaccard = 70 70+20+30 ≈ 0.583 Alaa Tharwat 36 / 74
  • 38. Illustrative example Imbalanced data: We increased the number of samples of the B class to 1000 to show how the classification metrics are changed when using imbalanced data, and there are 800 samples from class B were correctly classified The values of TP, TN, FP, and FN are 70, 800, 200, and 30, respectively Only the values of accuracy, precision/PPV, NPV, error rate, Optimization precision, F-measure, and Jaccard are changed as follows, Acc = 70+800 70+800+200+30 ≈ 0.79, PPV = 70 70+200 ≈ 0.26, NPV = 800 800+30 ≈ 0.96, Err = 1 − Acc = 0.21, OP = 0.79 − |0.7−0.8| 0.7+0.8 ≈ 0.723, F − measure = 1 2( 1 0.26 + 1 0.7) ≈ 2.64, and Jaccard = 70 70+200+30 ≈ 0.233 Thus, the accuracy, precision, NPV, F-measure, and Jaccard metrics are sensitive to imbalanced data Alaa Tharwat 37 / 74
  • 39. Illustrative example Multi-classification example: True Class A Predicted Class B C A B C 80 70 90 15 0 10 155 15 Figure: Results of a multi-class classification test (our example). In this example, there are three classes A, B, and C, the results of a classification test are shown in the figure The values of TPA, TPB, and TBC are 80, 70, and 90, respectively, which represent the diagonal The values of false negative for each class (true class) are calculated as mentioned before by adding all errors in the column of that class. Alaa Tharwat 38 / 74
  • 40. Illustrative example For example, FNA = EAB + EAC = 15 + 5 = 20, and similarly FNB = EBA + EBC = 15 + 15 = 30 and FNC = ECA + ECB = 0 + 10 = 10. The values of false positive for each class (predicted class) are calculated as mentioned before by adding all errors in the row of that class For example, FPA = EBA + ECA = 15 + 0 = 15, and similarly FPB = EAB + ECB = 15 + 10 = 25 and FPC = EAC + EBC = 5 + 15 = 20 The value of true negative for the class A (TNA) can be calculated by adding all columns and rows excluding the row and column of class A; this is similar to the TN in the 2 × 2 confusion matrix The value of TNA is calculated as follows, TNA = 70 + 90 + 10 + 15 = 185, and similarly TNB = 15 + 10 + 90 + 5 = 120 and TNC = 15 + 15 + 70 + 5 = 105. Using TP, TN, FP, and FN we can calculate all classification measures Alaa Tharwat 39 / 74
  • 41. Illustrative example The accuracy is 80+70+90 100+100+100 = 0.8 The sensitivity and specificity are calculated for each class. For example, the sensitivity of A is TPA TPA+FNA = 80 80+15+5 = 0.8, and similarly the sensitivity of B and C classes are 70 70+15+15 = 0.7 and 90 90+0+10 = 0.9, respectively, and the specificity values of A, B, and C are 185 185+15 ≈ 0.93, 120 120+25 ≈ 0.82, and 105 105+20 = 0.84, respectively Alaa Tharwat 40 / 74
  • 42. Illustrative example Introduction Classification Performance Classification metrics with imbalanced data Different Assessment methods Illustrative example Receiver Operating Characteristics (ROC) Area under the ROC curve (AUC) Precision-Recall (PR) curve Biometrics measures Conclusions Alaa Tharwat 41 / 74
  • 43. Receiver Operating Characteristics (ROC) The receiver operating characteristics (ROC) curve is a two-dimensional graph in which the TPR represents the y-axis and FPR is the x-axis Any classifier that has discrete outputs such as decision trees is designed to produce only a class decision, i.e., a decision for each testing sample, and hence it generates only one confusion matrix which in turn corresponds to one point into the ROC space Continuous output classifiers such as the Naive Bayes classifier, the output is represented by a numeric value, i.e., score, which represents the degree to which a sample belongs to a specific class Alaa Tharwat 42 / 74
  • 44. Receiver Operating Characteristics (ROC) The ROC curve is generated by changing the threshold on the confidence score; hence, each threshold generates only one point in the ROC curve 0 0.2 0.4 0.6 0.8 1.00 0.2 0.4 0.6 0.8 1.0 False positive rate (FPR) Truepositiverate(TPR) A B C D Better W orse Optimal classifier Optimistic Pessimistic Expected C Figure: A basic ROC curve showing important points, and the optimistic, pessimistic and expected ROC segments for equally scored samples. Alaa Tharwat 43 / 74
  • 45. Receiver Operating Characteristics (ROC) There are four important points in the ROC curve The point A, in the lower left corner (0, 0) represents a classifier where there is no positive classification, while all negative samples are correctly classified and hence TPR = 0 and FPR = 0 The point C, in the top right corner (1,1), represents a classifier where all positive samples are correctly classified, while the negative samples are misclassified The point D in the lower right corner (1, 0) represents a classifier where all positive and negative samples are misclassified The point B in the upper left corner (0, 1) represents a classifier where all positive and negative samples are correctly classified; thus, this point represents the perfect classification or the Ideal operating point. 0 0.2 0.4 0.6 0.8 1.00 0.2 0.4 0.6 0.8 1.0 False positive rate (FPR) Truepositiverate(TPR) A B C D Better W orse Optimal classifier Optimistic Pessimistic Expected C Alaa Tharwat 44 / 74
  • 46. Receiver Operating Characteristics (ROC) The perfect classification performance is the green curve which rises vertically from (0,0) to (0,1) and then horizontally to (1,1). This curve reflects that the classifier perfectly ranked the positive samples relative to the negative samples A point in the ROC space is better than all other points that are in the southeast, i.e., the points that have lower TPR, higher FPR, or both Alaa Tharwat 45 / 74
  • 47. Receiver Operating Characteristics (ROC) In the example below, a test set consists of 20 samples from two classes; each class has ten samples, i.e., ten positive and ten negative samples Sample No. Class Score 3 6 18 4 1 2 13 14 7 12 5 11 8 15 16 17 10 19 9 20 t1= ∞ t21= 0.1 Class Score Sort P P N P P P N N P N P N P N N N P N P N 0.82 0.8 0.75 0.7 0.62 0.6 0.54 0.5 0.49 0.45 0.4 0.39 0.37 0.32 0.3 0.26 0.23 0.21 0.19 0.1 ∞ t2= 0.82 t3= 0.8 t4= 0.75 t5= 0.7 t6= 0.62 t7= 0.6 t8= 0.54 t9= 0.5 t10=0.49 t11=0.45 t12= 0.4 t13=0.39 t14=0.37 t15=0.32 t16= 0.3 t17=0.26 t18=0.23 t19=0.21 t20=0.19 0 0 10 10 1 0 9 10 2 0 8 10 2 1 8 9 3 1 7 9 6 3 4 7 4 1 6 9 5 3 5 7 5 2 5 8 10 10 0 0 10 9 0 1 9 9 1 1 9 8 1 2 8 8 2 2 7 4 3 6 8 7 2 3 8 6 2 4 7 5 3 5 8 5 2 5 6 4 4 6 5 1 5 9 P P N P P P N N P N P N P N N N P N P N 0.82 0.8 0.75 0.7 0.62 0.6 0.54 0.5 0.49 0.45 0.4 0.39 0.37 0.32 0.3 0.26 0.23 0.21 0.19 0.1 Sample No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Alaa Tharwat 46 / 74
  • 48. Receiver Operating Characteristics (ROC) The initial step to plot the ROC curve is to sort the samples according to their scores Next, the threshold value is changed from maximum to minimum to plot the ROC curve To scan all samples, the threshold is ranged from ∞ to −∞ The samples are classified into the positive class if their scores are higher than or equal the threshold; otherwise, it is estimated as negative Sample No. Class Score 3 6 18 4 1 2 13 14 7 12 5 11 8 15 16 17 10 19 9 20 t1= ∞ t21= 0.1 Class Score Sort P P N P P P N N P N P N P N N N P N P N 0.82 0.8 0.75 0.7 0.62 0.6 0.54 0.5 0.49 0.45 0.4 0.39 0.37 0.32 0.3 0.26 0.23 0.21 0.19 0.1 ∞ t2= 0.82 t3= 0.8 t4= 0.75 t5= 0.7 t6= 0.62 t7= 0.6 t8= 0.54 t9= 0.5 t10=0.49 t11=0.45 t12= 0.4 t13=0.39 t14=0.37 t15=0.32 t16= 0.3 t17=0.26 t18=0.23 t19=0.21 t20=0.19 0 0 10 10 1 0 9 10 2 0 8 10 2 1 8 9 3 1 7 9 6 3 4 7 4 1 6 9 5 3 5 7 5 2 5 8 10 10 0 0 10 9 0 1 9 9 1 1 9 8 1 2 8 8 2 2 7 4 3 6 8 7 2 3 8 6 2 4 7 5 3 5 8 5 2 5 6 4 4 6 5 1 5 9 P P N P P P N N P N P N P N N N P N P N 0.82 0.8 0.75 0.7 0.62 0.6 0.54 0.5 0.49 0.45 0.4 0.39 0.37 0.32 0.3 0.26 0.23 0.21 0.19 0.1 Sample No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Alaa Tharwat 47 / 74
  • 49. Receiver Operating Characteristics (ROC) Changing the threshold value changes the TPR and FPR At the beginning, the threshold value is set at maximum (t1 = ∞); hence, all samples are classified as negative samples and the values of FPR and TPR are zeros and the position of t1 is in the lower left corner (the point (0,0)) 0 0.2 0.4 0.6 0.8 1.00 0.2 0.4 0.6 0.8 1.0 False positive rate (FPR) Truepositiverate(TPR) t1 0.1 0.3 0.5 0.7 0.9 0.9 0.7 0.5 0.3 0.1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 Figure: An illustrative example of the ROC curve. The values of TPR and FPR of each point/threshold are calculated in the next Table. Alaa Tharwat 48 / 74
  • 50. Receiver Operating Characteristics (ROC) Next, the threshold value is decreased to 0.82, and the first sample is classified correctly as a positive sample, and the TPR increased to 0.1, while the FPR remains zero Increasing the TPR moves the ROC curve up while increasing the FPR moves the ROC curve to the right as in t4 The ROC curve must pass through the point (0,0) where the threshold value is ∞ (in which all samples are classified as negative samples) and the point (1,1) where the threshold is −∞ (in which all samples are classified as positive samples) 0 0.2 0.4 0.6 0.8 1.00 0.2 0.4 0.6 0.8 1.0 False positive rate (FPR) Truepositiverate(TPR) t1 0.1 0.3 0.5 0.7 0.9 0.9 0.7 0.5 0.3 0.1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 Alaa Tharwat 49 / 74
  • 51. Receiver Operating Characteristics (ROC) Threshold TP FN TN FP TPR FPR FNR PPV Acc t1 = ∞ 0 10 10 0 0 0 1 - 50 t2 = 0.82 1 9 10 0 0.1 0 0.9 1.0 55 t3 = 0.80 2 8 10 0 0.2 0 0.8 1.0 60 t4 = 0.75 2 8 9 1 0.2 0.1 0.8 0.67 55 t5 = 0.70 3 7 9 1 0.3 0.1 0.7 0.75 60 t6 = 0.62 4 6 9 1 0.4 0.1 0.6 0.80 65 t7 = 0.60 5 5 9 1 0.5 0.1 0.5 0.83 70 t8 = 0.54 5 5 8 2 0.5 0.2 0.5 0.71 65 t9 = 0.50 5 5 7 3 0.5 0.3 0.5 0.63 60 t10 = 0.49 6 4 7 3 0.6 0.3 0.4 0.67 65 t11 = 0.45 6 4 6 4 0.6 0.4 0.4 0.60 60 t12 = 0.40 7 3 6 4 0.7 0.4 0.3 0.64 65 t13 = 0.39 7 3 5 5 0.7 0.5 0.3 0.58 60 t14 = 0.37 8 2 5 5 0.8 0.5 0.2 0.62 65 t15 = 0.32 8 2 4 6 0.8 0.6 0.2 0.57 60 t16 = 0.30 8 2 3 7 0.8 0.7 0.2 0.53 55 t17 = 0.26 8 2 2 8 0.8 0.8 0.2 0.50 50 t18 = 0.23 9 1 2 8 0.9 0.8 0.1 0.53 55 t19 = 0.21 9 1 1 9 0.9 0.9 0.1 0.50 50 t20 = 0.19 10 0 1 9 1.0 0.9 0 0.53 55 t21 = 0.10 10 0 0 10 1.0 1.0 0 0.50 50 Alaa Tharwat 50 / 74
  • 52. Receiver Operating Characteristics (ROC) It is clear that the ROC curve is a step function. This is because we only used 20 samples (a finite set of samples) in our example and a true curve can be obtained when the number of samples increased The figure also shows that the best accuracy (70%) (see the Table) is obtained at (0.1,0.5) when the threshold value was ≥ 0.6, rather than at ≥ 0.5 as we might expect with a balanced data. This means that the given learning model identifies positive samples better than negative samples. 0 0.2 0.4 0.6 0.8 1.00 0.2 0.4 0.6 0.8 1.0 False positive rate (FPR) Truepositiverate(TPR) t1 0.1 0.3 0.5 0.7 0.9 0.9 0.7 0.5 0.3 0.1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 Alaa Tharwat 51 / 74
  • 53. Receiver Operating Characteristics (ROC) TN Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=10 TN=10 TP=0 FP=0 Minimize Threshold Distribution for positive Class Distribution for negative Class t1 FN (a) t1 TN Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=8 TN=10 TP=2 FP=0 Minimize Threshold Distribution for positive Class Distribution for negative Class t3 FN TP (b) t3 TN Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=5 TN=8 TP=5 FP=2 Minimize Threshold Distribution for positive Class Distribution for negative Class t8 FN TP FP (c) t8 Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=4 TN=6 TP=6 FP=4 Minimize Threshold Distribution for positive Class Distribution for negative Class t11 TP FN FP TN (d) t11 Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=2 TN=5 TP=8 FP=5 Minimize Threshold Distribution for positive Class Distribution for negative Class t14 TP TN FN FP (e) t14 Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=0 TN=1 TP =10 FP=9 Minimize Threshold Distribution for positive Class Distribution for negative Class t20 TN FP TP (f) t19 Figure: A visualization of how changing the threshold changes the TP, TN, FP, and FN values. Alaa Tharwat 52 / 74
  • 54. Receiver Operating Characteristics (ROC) t1: The value of this threshold was ∞ and hence all samples are classified as negative samples. This means that (1) all positive samples are incorrectly classified; hence, the value of TP is zero, (2) all negative samples are correctly classified and hence there is no FP TN Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=10 TN=10 TP=0 FP=0 Minimize Threshold Distribution for positive Class Distribution for negative Class t1 FN Alaa Tharwat 53 / 74
  • 55. Receiver Operating Characteristics (ROC) t3: The threshold value decreased and as shown there are two positive samples are correctly classified. Therefore, according to the positive class, only the positive samples which have scores more than or equal this threshold (t3) will be correctly classified, i.e., TP, while the other positive samples are incorrectly classified, i.e., FN. In this threshold, also all negative samples are correctly classified; thus, the value of FP is still zero TN Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=8 TN=10 TP=2 FP=0 Minimize Threshold Distribution for positive Class Distribution for negative Class t3 FN TP Alaa Tharwat 54 / 74
  • 56. Receiver Operating Characteristics (ROC) t8: As the threshold further decreased to be 0.54, the threshold line moves to the left. This means that more positive samples have the chance to be correctly classified; on the other hand, some negative samples are misclassified. As a consequence, the values of TP and FP are increased and the values of TN and FN decreased TN Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=5 TN=8 TP=5 FP=2 Minimize Threshold Distribution for positive Class Distribution for negative Class t8 FN TP FP Alaa Tharwat 55 / 74
  • 57. Receiver Operating Characteristics (ROC) t11: This is an important threshold value where the numbers of errors from both positive and negative classes are equal (i.e., TP = TN = 6 and FP = FN = 4) Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=4 TN=6 TP=6 FP=4 Minimize Threshold Distribution for positive Class Distribution for negative Class t11 TP FN FP TN Alaa Tharwat 56 / 74
  • 58. Receiver Operating Characteristics (ROC) t14: Reducing the value of the threshold to 0.37 results more correctly classified positive samples and this increases TP and reduces FN. On the contrary, more negative samples are misclassified and this increases FP and reduces TN Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=2 TN=5 TP=8 FP=5 Minimize Threshold Distribution for positive Class Distribution for negative Class t14 TP TN FN FP Alaa Tharwat 57 / 74
  • 59. Receiver Operating Characteristics (ROC) t20: Decreasing the threshold value hides the FN area. This is because all positive samples are correctly classified. Also, from the figure, it is clear that the FP area is much larger than the area of TN. This is because 90% of the negative samples are incorrectly classified, and only 10% of negative samples are correctly classified Positive (P) Negative (N) True (T) False (F) P=TP+FN N=FP+TN FN=0 TN=1 TP =10 FP=9 Minimize Threshold Distribution for positive Class Distribution for negative Class t20 TN FP TP Alaa Tharwat 58 / 74
  • 60. Receiver Operating Characteristics (ROC) 1: Given a set of test samples (Stest = {s1, s2, . . . , sN }), where N is the total number of test samples, P and N represent the total number of positive and negative samples, respectively. 2: Sort the samples corresponding to their scores, Ssorted is the sorted samples. 3: FP ← 0, TP ← 0, fprev ← −∞, and ROC = [ ]. 4: for i = 1 to |Ssorted| do 5: if f(i) = fprev then 6: ROC(i) ← (FP N , TP P ), fprev ← f(i) 7: end if 8: if Ssorted(i) is a positive sample then 9: TP ← TP + 1. 10: else 11: FP ← FP + 1. 12: end if 13: end for 14: ROC(i) ← (FP N , TP P ). Alaa Tharwat 59 / 74
  • 61. Receiver Operating Characteristics (ROC) Introduction Classification Performance Classification metrics with imbalanced data Different Assessment methods Illustrative example Receiver Operating Characteristics (ROC) Area under the ROC curve (AUC) Precision-Recall (PR) curve Biometrics measures Conclusions Alaa Tharwat 60 / 74
  • 62. Area under the ROC curve (AUC) Comparing different classifiers in the ROC curve is not easy. This is because there is no scalar value represents the expected performance Therefore, the Area under the ROC curve (AUC) metric is used to calculate the area under the ROC curve The AUC score is always bounded between zero and one, and there is no realistic classifier has an AUC lower than 0.5 0 0.2 0.4 0.6 0.8 1.00 0.2 0.4 0.6 0.8 1.0 False positive rate (FPR) Truepositiverate(TPR) B A FPR1 FPR2 TPR1 TPR2 Figure: An illustrative example of the AUC metric. Alaa Tharwat 61 / 74
  • 63. Area under the ROC curve (AUC) The figure below shows the AUC value of two classifiers, A and B, and the AUC of B classifier is greater than A; hence, it achieves better performance The gray shaded area is common in both classifiers, while the red shaded area represents the area where the B classifier outperforms the A classifier It is possible for a lower AUC classifier to outperform a higher AUC classifier in a specific region. For example, the classifier B outperforms A except at FPR 0.6 where A has a slight difference (blue shaded area) 0 0.2 0.4 0.6 0.8 1.00 0.2 0.4 0.6 0.8 1.0 False positive rate (FPR) Truepositiverate(TPR) B A FPR1 FPR2 TPR1 TPR2 Alaa Tharwat 62 / 74
  • 64. Area under the ROC curve (AUC) However, two classifiers with two different ROC curves may have the same AUC score The steps in the AUC Algorithm represent a slight modification from the ROC Algorithm. In other words, instead of generating ROC points, while the AUC adds areas of trapezoids1 of the ROC curve The AUC score can be calculated by adding the areas of trapezoids of the AUC measure The AUC can be also calculated under the Precision-Recall curve using the trapezoidal rule as in the ROC curve, and the AUC score of the perfect classifier in PR curves is one as in ROC curves. 1 A trapezoid is a 4-sided shape with two parallel sides. Alaa Tharwat 63 / 74
  • 65. Area under the ROC curve (AUC) Introduction Classification Performance Classification metrics with imbalanced data Different Assessment methods Illustrative example Receiver Operating Characteristics (ROC) Area under the ROC curve (AUC) Precision-Recall (PR) curve Biometrics measures Conclusions Alaa Tharwat 64 / 74
  • 66. Precision-Recall (PR) curve Precision and recall metrics are widely used for evaluating the classification performance. The Precision-Recall (PR) curve has the same concept of the ROC curve, and it can be generated by changing the threshold as in ROC However, the ROC curve shows the relation between sensitivity/recall (TPR) and 1-specificity (FPR) while the PR curve shows the relationship between recall and precision Thus, in the PR curve, the x-axis is the recall and the y-axis is the precision, i.e., the x-axis of ROC curve is the y-axis of PR curve. Hence, in the PR curve, there is no need for the TN value 0 0.2 0.4 0.6 0.8 1.00 0.2 0.4 0.6 0.8 1.0 Recall Precision t1 0.1 0.3 0.5 0.7 0.9 0.9 0.7 0.5 0.3 0.1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 Optimal classifier Alaa Tharwat 65 / 74
  • 67. Precision-Recall (PR) curve In the PR curve, the precision value for the first point is undefined because the number of positive predictions is zero, i.e., TP = 0 and FP = 0. This problem can be solved by estimating the first point in the PR curve from the second point. There are two cases for estimating the first point depending on the value of TP of the second point: 1 The number of true positives of the second point is zero: In this case, since the second point is (0,0), the first point is also (0,0) 2 The number of true positives of the second point is not zero: this is similar to our example where the second point is (0.1, 1.0). The first point can be estimated by drawing a horizontal line from the second point to the y-axis. Thus, the first point is estimated as (0.0, 1.0) Alaa Tharwat 66 / 74
  • 68. Precision-Recall (PR) curve The PR curve is often zigzag curve; hence, PR curves tend to cross each other much more frequently than ROC curves In the PR curve, a curve above the other has a better classification performance The perfect classification performance in the PR curve is represented by a green curve, and this curve starts from the (0,1) horizontally to (1,1) and then vertically to (1,0), where (0,1) represents a classifier that achieves 100% precision and 0% recall, (1,1) represents a classifier that obtains 100% precision and sensitivity and this is the ideal point in the PR curve, and (1,0) indicates the classifier obtains 100% sensitivity and 0% precision The PR curve is to the upper right corner, the better the classification performance is. Since the PR curve depends only on the precision and recall measures, it ignores the performance of correctly handling negative examples (TN) Alaa Tharwat 67 / 74
  • 69. Precision-Recall (PR) curve Introduction Classification Performance Classification metrics with imbalanced data Different Assessment methods Illustrative example Receiver Operating Characteristics (ROC) Area under the ROC curve (AUC) Precision-Recall (PR) curve Biometrics measures Conclusions Alaa Tharwat 68 / 74
  • 70. Biometrics measures Biometrics matching is slightly different than the other classification problems and hence it is sometimes called two-instance problem In this problem, instead of classifying one sample into one of c groups or classes, biometric determines if the two samples are in the same group. This can be achieved by identifying an unknown sample by matching it with all the other known samples The model assigns the unknown sample to the person which has the most similar score. If this level of similarity is not reached, the sample is rejected Theoretically, scores of clients (persons known by the biometric system) should always be higher than the scores of imposters (persons who are not known by the system) In biometric systems, a single threshold separates the two groups of scores, i.e., clients and imposters In real applications, sometimes imposter samples generate scores higher than the scores of some client samples. Accordingly, it is a fact that however the classification threshold is perfectly chosen, some classification errors occur Alaa Tharwat 69 / 74
  • 71. Biometrics measures Two of the most commonly used measures in biometrics are the False acceptance rate (FAR) and False rejection/recognition rate (FRR) The FAR is also called false match rate (FMR) and it is the ratio between the number of false acceptance to the total number of imposters attempts Hence, to prevent imposter samples from being easily correctly identified by the model, the similarity score has to exceed a certain level False rejection rate Decision Threshold False acceptance rate Errorrate(%) AcceptedRejected FRR (%) FAR(%) 100 % 100 % Equal error rate (EER) Minimize Threshold Genuine attempts Imposter attempts One possible threshold value Score Match Non-match Alaa Tharwat 70 / 74
  • 72. Biometrics measures The FRR or false non-match rate (FNMR) measures the likelihood that the biometric model will incorrectly reject a client, and it represents the ratio between the number of false recognitions to the total number of clients’ attempts Equal error rate (EER) measure solves the problem of selecting a threshold value partially, and it represents the failure rate when the values of FMR and FNMR are equal False rejection rate Decision Threshold False acceptance rate Errorrate(%) AcceptedRejected FRR (%) FAR(%) 100 % 100 % Equal error rate (EER) Minimize Threshold Genuine attempts Imposter attempts One possible threshold value Score Match Non-match Alaa Tharwat 71 / 74
  • 73. Biometrics measures Detection Error Trade-off (DET) curve is used for evaluating biometric models In this curve, as in the ROC and PR curves, the threshold value is changed and the values of FAR and FRR are calculated at each threshold This curve shows the relation between FAR and FRR The ideal point in this curve is the origin point where the values of both FRR and FAR are zeros and hence the perfect classification performance in the DET curve is represented by a green curve 0 0.2 0.4 0.6 0.8 1.00 0.2 0.4 0.6 0.8 1.0 False rejection rate (FRR) Falseacceptancerate(FAR) t1 0.1 0.3 0.5 0.7 0.9 0.9 0.7 0.5 0.3 0.1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 Optimal classifier Alaa Tharwat 72 / 74
  • 74. Biometrics measures Introduction Classification Performance Classification metrics with imbalanced data Different Assessment methods Illustrative example Receiver Operating Characteristics (ROC) Area under the ROC curve (AUC) Precision-Recall (PR) curve Biometrics measures Conclusions Alaa Tharwat 73 / 74
  • 75. Conclusions For more details, download this paper ”Alaa Tharwat Classification Assessment Methods, Applied Computing and Informatics, 2018”. Link to the paper https://www.sciencedirect.com/science/article/pii/S2210832718301546 Alaa Tharwat 74 / 74View publication statsView publication stats