SlideShare a Scribd company logo
1 of 45
Download to read offline
Classifier performance evaluation
Václav Hlaváč
Czech Technical University in Prague, Faculty of Electrical Engineering
Department of Cybernetics, Center for Machine Perception
121 35 Praha 2, Karlovo náměstí 13, Czech Republic
http://cmp.felk.cvut.cz/˜hlavac, hlavac@fel.cvut.cz
LECTURE PLAN

Classifier performance as the statistical
hypothesis testing.

Criterion functions.

Confusion matrix, characteristics.

Receiver operation curve (ROC).
APPLICATION DOMAINS

Classifiers (our main concern in this
lecture), 1940s – radars, 1970s – medical
diagnostics, 1990s – data mining.

Regression.

Estimation of probability densities.
2/44
Classifier experimental evaluation
To what degree should we believe to the learned classifier?

Classifiers (both supervised and unsupervised) are learned (trained) on a
finite training multiset (named simply training set in the sequel for
simplicity).

A learned classifier has to be tested on a different test set experimentally.

The classifier performs on different data in the run mode that on which it
has learned.

The experimental performance on the test data is a proxy for the
performance on unseen data. It checks the classifier’s generalization ability.

There is a need for a criterion function assessing the classifier performance
experimentally, e.g., its error rate, accuracy, expected Bayesian risk (to be
discussed later). A need for comparing classifiers experimentally.
3/44
Evaluation is an instance of hypothesis testing

Evaluation has to be treated as hypothesis testing in statistics.

The value of the population parameter has to be statistically inferred based
on the sample statistics (i.e., a training set in pattern recognition).
population
statistical
sampling
sample
statistic
accuracy = 97%
parameter
accuracy = 95%
statistical inference
4/44
Danger of overfitting

Learning the training data too precisely usually leads to poor classification
results on new data.

Classifier has to have the ability to generalize.
underfit fit overfit
5/44
Training vs. test data
Problem: Finite data are available only and have to be used both for training and
testing.

More training data gives better generalization.

More test data gives better estimate for the classification error probability.

Never evaluate performance on training data. The conclusion would be
optimistically biased.
Partitioning of available finite set of data to training / test sets.

Hold out.

Cross validation.

Bootstrap.
Once evaluation is finished, all the available data can be used to train the final
classifier.
6/44
Hold out method

Given data is randomly partitioned into two independent sets.
• Training multi-set (e.g., 2/3 of data) for the statistical model
construction, i.e. learning the classifier.
• Test set (e.g., 1/3 of data) is hold out for the accuracy estimation of
the classifier.

Random sampling is a variation of the hold out method:
Repeat the hold out k times, the accuracy is estimated as the average of the
accuracies obtained.
7/44
K-fold cross validation

The training set is randomly divided
into K disjoint sets of equal size
where each part has roughly the same
class distribution.

The classifier is trained K times, each
time with a different set held out as a
test set.

The estimated error is the mean of
these K errors.

Ron Kohavi, A Study of
Cross-Validation and Bootstrap for
Accuracy Estimation and Model
Selection, IJCAI 1995.
Example, courtesy TU Graz
Courtesy: the presentation of Jin Tian, Iowa State Univ..
8/44
Leave-one-out

A special case of K-fold cross validation with K = n, where n is the total
number of samples in the training multiset.

n experiments are performed using n − 1 samples for training and the
remaining sample for testing.

It is rather computationally expensive.

Leave-one-out cross-validation does not guarantee the same class
distribution in training and test data!
The extreme case:
50% class A, 50% class B. Predict majority class label in the training data.
True error 50%; Leave-one-out error estimate 100%!
Courtesy: The presentation of Jin Tian, Iowa State Univ..
9/44
Bootstrap aggregating, called also bagging

The bootstrap uses sampling with replacement to form the training set.

Given: the training set T consisting of n entries.

Bootstrap generates m new datasets Ti each of size n0
 n by sampling T
uniformly with replacement. The consequence is that some entries can be
repeated in Ti.

In a special case (called 632 boosting) when n0
= n, for large n, Ti is
expected to have 1 − 1
e ≈ 63.2% of unique samples. The rest are duplicates.

The m statistical models (e.g., classifiers, regressors) are learned using the
above m bootstrap samples.

The statistical models are combined, e.g. by averaging the output (for
regression) or by voting (for classification).

Proposed in: Breiman, Leo (1996). Bagging predictors. Machine Learning 24
(2): 123–140.
10/44
Recomended experimental validation procedure

Use K-fold cross-validation (K = 5 or K = 10) for estimating performance
estimates (accuracy, etc.).

Compute the mean value of performance estimate, and standard deviation
and confidence intervals.

Report mean values of performance estimates and their standard deviations
or 95% confidence intervals around the mean.
Courtesy: the presentation of Jin Tian, Iowa State Univ..
11/44
Criterion function to assess classifier performance

Accuracy, error rate.
• Accuracy is the percent of correct classifications.
• Error rate = is the percent of incorrect classifications.
• Accuracy = 1 - Error rate.
• Problems with the accuracy:
Assumes equal costs for misclassification.
Assumes relatively uniform class distribution (cf. 0.5% patients of
certain disease in the population).

Other characteristics derived from the confusion matrix (to be explained
later).

Expected Bayesian risk (to be explained later).
12/44
Confusion matrix, two classes only
The confusion matrix is also called
the contingency table.
a
TN - True Negative
correct rejections
c
FN - False Negative
misses, type II error
overlooked danger
d
TP - True Positive
hits
b
FP - False Positive
false alarms
type I error
predicted
actual
examples
negative
negative
positive
positive
R. Kohavi, F. Provost: Glossary of
terms, Machine Learning, Vol. 30,
No. 2/3, 1998, pp. 271-274.
Performance measures calculated from the
confusion matrix entries:

Accuracy = (a + d)/(a + b + c + d) =
(TN + TP)/total

True positive rate, recall, sensitivity =
d/(c + d) = TP/actual positive

Specificity, true negative rate = a/(a + b) =
TN/actual negative

Precision, predicted positive value =
d/(b + d) = TP/predicted positive

False positive rate, false alarm = b/(a + b)
= FP/actual negative = 1 - specificity

False negative rate = c/(c + d) =
FN/actual positive
13/44
Confusion matrix, # of classes  2

The toy example (courtesy Stockman) shows predicted and true class labels
of optical character recognition for numerals 0-9. There were 100 examples
of each number class available for the evaluation. Empirical performance is
given in percents.

The classifier allows the reject option, class label R.

Notice, e.g., unavoidable confusion between 4 and 9.
class j predicted by a classifier
true class i ‘0’ ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘6’ ‘7’ ‘8’ ‘9’ ‘R’
‘0’ 97 0 0 0 0 0 1 0 0 1 1
‘1’ 0 98 0 0 1 0 0 1 0 0 0
‘2’ 0 0 96 1 0 1 0 1 0 0 1
‘3’ 0 0 2 95 0 1 0 0 1 0 1
‘4’ 0 0 0 0 98 0 0 0 0 2 0
‘5’ 0 0 0 1 0 97 0 0 0 0 2
‘6’ 1 0 0 0 0 1 98 0 0 0 0
‘7’ 0 0 1 0 0 0 0 98 0 0 1
‘8’ 0 0 0 1 0 0 1 0 96 1 1
‘9’ 1 0 0 0 3 1 0 0 0 95 0
14/44
Unequal costs of decisions

Examples:
Medical diagnosis: The cost of falsely indicated breast cancer in population
screening is smaller than the cost of missing a true disease.
Defense against ballistic missiles: The cost of missing a real attack is much
higher than the cost of false alarm.

Bayesian risk is able to represent unequal costs of decisions.

We will show that there is a tradeoff between apriori probability of the class
and the induced cost.
15/44
Criterion, Bayesian risk
For set of observations X, set of hidden states Y and decisions D, statistical
model given by the joint probability pXY : X × Y → R and the penalty function
W: Y × D → R and a decision strategy Q: X → D the Bayesian risk is given as
R(Q) =
X
x∈X
X
y∈Y
pXY (x, y) W(y, Q(x)) .

It is difficult to fulfil the assumption that a statistical model pXY is known
in many practical tasks.

If the Bayesian risk would be calculated on the finite (training) set then it
would be too optimistic.

Two substitutions for Baysesian risk are used in practical tasks:
• Expected risk.
• Structural risk.
16/44
Two paradigms for learning classifiers

Choose a class Q of decision functions (classifiers) q: X → Y .

Find q∗
∈ Q by minimizing some criterion function on the training set that
approximates the risk R(q) (which cannot be computed).

Learning paradigm is defined by the criterion function:
1. Expected risk minimization in which the true risk is approximated by
the error rate on the training set,
Remp(q(x, Θ)) =
1
L
L
X
i=1
W(yi, q(xi, Θ)) ,
Θ∗
= argmin
Θ
Remp(q(x, Θ)) .
Examples: Perceptron, Neural nets (Back-propagation), etc.
2. Structural risk minimization which introduces the guaranteed
risk J(Q), R(Q)  J(Q).
Example: SVM (Support Vector Machines).
17/44
Problem of unknown class distribution and costs
We already know: The class distribution and the costs of each error determine
the goodness of classifiers.
Additional problem:

In many circumstances, until the application time, we do not know the
class distribution and/or it is difficult to estimate the cost matrix. E.g., an
email spam filter.

Statistical models have to be learned before.
Possible solution: Incremental learning.
18/44
Unbalanced problems and data

Classes have often unequal frequency.
• Medical diagnosis: 95 % healthy, 5% disease.
• e-Commerce: 99 % do not buy, 1 % buy.
• Security: 99.999 % of citizens are not terrorists.

Similar situation for multiclass classifiers.

Majority class classifier can be 99 % correct but useless.
Example: OCR, 99 % correct, error at the every second line. This is why
OCR is not widely used.

How should we train classifiers and evaluated them for unbalanced problems?
19/44
Balancing unbalanced data
Two class problems:

Build a balanced training set, use it for classifier training.
• Randomly select desired number of minority class instances.
• Add equal number of randomly selected majority class instances.

Build a balanced test set (different from training set, of course) and
test the classifier using it.
Multiclass problems:

Generalize ‘balancing’ to multiple classes.

Ensure that each class is represented with approximately equal
proportions in training and test datasets.
20/44
Thoughts about balancing
Natural hesitation: Balancing the training set changes the underlying statistical
problem. Can we change it?
Hesitation is justified. The statistical problem is indeed changed.
Good news:

Balancing can be seen as the change of the penalty function. In the
balanced training set, the majority class patterns occur less often which
means that the penalty assigned to them is lowered proportionally to their
relative frequency.
R(q∗
) = min
q∈D
X
x∈X
X
y∈Y
pXY (x, y) W(y, q(x))
R(q∗
) = min
q(x)∈D
X
x∈X
X
y∈Y
p(x) pY |X(y|x) W(y, q(x))

This modified problem is what the end user usually wants as she/he is
interested in a good performance for the minority class.
21/44
Scalar characteristics are not good
for evaluating performance

Scalar characteristics as the accuracy, expected cost, area under ROC curve
(AUC, to be explained soon) do not provide enough information.

We are interested in:
• How are errors distributed across the classes?
• How will each classifier perform in different testing conditions (costs or
class ratios other than those measured in the experiment)?

Two numbers – true positive rate and false positive rate – are much more
informative than the single number.

These two numbers are better visualized by a curve, e.g., by a Receiver
Operating Characteristic (ROC), which informs about:
• Performance for all possible misclassification costs.
• Performance for all possible class ratios.
• Under what conditions the classifier c1 outperforms the classifier c2?
22/44
ROC – Receiver Operating Characteristic

Called also often ROC curve.

Originates in WWII processing of radar
signals.

Useful for the evaluation of dichotomic
classifiers performance.

Characterizes degree of overlap of classes
for a single feature.

Decision is based on a single threshold Θ
(called also operating point).

Generally, false alarms go up with attempts
to detect higher percentages of true
objects.

A graphical plot showing (hit rate, false
alarm rate) pairs.

Different ROC curves correspond to
different classifiers. The single curve is the
result of changing threshold Θ.
0
0
100 %
100
%
50
50
f rate
alse positive
true
positive
rate
23/44
A model problem
Receiver of weak radar signals

Suppose a receiver detecting a single weak pulse (e.g., a radar reflection
from a plane, a dim flash of light).

A dichotomic decision, two hidden states y1, y2:
• y1 – a plane is not present (true negative) or
• y2 – a plane is present (true positive).

Assume a simple statistical model – two Gaussians.

Internal signal of the receiver, voltage x with the mean
µ1 when the plane (external signal) is not present and
µ2 when the plane is present.

Random variables due to random noise in the receiver and outside of it.
p(x|yi) = N(µi, σ2
i ), i = 1, 2.
24/44
Four probabilities involved

Let suppose that the involved probability distributions are Gaussians and the
correct decision of the receiver are known.

The mean values µ1, µ2, standard deviations σ1, σ1, and the threshold Θ
are not known too.

There are four conditional probabilities involved:
• Hit (true positive) p(x  Θ | x ∈ y2).
• False alarm, type I error (false positive) p(x  Θ | x ∈ y1).
• Miss, overlooked danger, type II error (false negative)
p(x  Θ | x ∈ y2).
• Correct rejection (true negative) p(x  Θ | x ∈ y1).
25/44
Radar receiver example, graphically
p(x|y ), i=1,2
i
p(x|y )
1
p(x|y )
2
x
threshold Q
hits, true
positives
false alarms
false positives
Any decision threshold Θ on the voltage x, x  Θ, determines:
Hits – their probability is the hatched area under the curve p(x|y1).
False alarms – their probability is the red filled area under the curve p(x|y2).
Courtesy to Silvio Maffei for pointing to the mistake in the picture.
26/44
ROC – Example A
Two Gaussians with equal variances

Two Gaussians, µ1 = 4.0, µ2 = 6.0, σ1 = σ2 = 1.0

Less overlap, better discriminability.

In this special case, ROC is convex.
p(x|y)
x
1.0
true
positive
rate
0.5
0.25
0.75
0.0
0.0
0.75
0.25
false positive rate
0.5
1.0
ROC curve
27/44
ROC – Example B
Two gaussians with equal variances

Two Gaussians, µ1 = 4.0, µ2 = 5.0, σ1 = σ2 = 1.0

More overlap, worse discriminability.

In this special case, ROC is convex.
p(x|y)
1.0
true
positive
rate
0.5
0.25
0.75
0.0
0.0
0.75
0.25
false positive rate
0.5
1.0
ROC curve
28/44
ROC – Example C
Two gaussians with equal variances

Two Gaussians, µ1 = 4.0, µ2 = 4.1, σ1 = σ2 = 1.0

Almost total overlap, almost no discriminability.

In this special case, ROC is convex.
p(x|y)
29/44
ROC and the likelihood ratio

Under the assumption of the Gaussian signal corrupted by the Gaussian
noise, the slope of the ROC curve equals to the likelihood ratio
L(x) =
p(x|noise)
p(x|signal)
.

In the even more special case, when standard deviations σ1 = σ2 then
• L(x) increases monotonically with x. Consequently, ROC becomes a
convex curve.
• The optimal threshold Θ becomes
Θ =
p(noise)
p(signal)
.
30/44
ROC – Example D
Two gaussians with different variances

Two Gaussians, µ1 = 4.0, µ2 = 6.0, σ1 = 1.0, σ2 = 2.0

Less overlap, better discriminability.

In general, ROC is not convex.
p(x|y)
1.0
true
positive
rate
0.5
0.25
0.75
0.0
0.0
0.75
0.25
false positive rate
0.5
1.0
ROC curve
31/44
ROC – Example E
Two gaussians with different variances

Two Gaussians, µ1 = 4.0, µ2 = 4.5, σ1 = 1.0, σ2 = 2.0

More overlap, worse discriminability.

In general, ROC is not convex.
p(x|y)
1.0
true
positive
rate
0.5
0.25
0.75
0.0
0.0
0.75
0.25
false positive rate
0.5
1.0
ROC curve
32/44
ROC – Example F
Two gaussians with different variances

Two Gaussians, µ1 = 4.0, µ2 = 4.0, σ1 = 1.0, σ2 = 2.0

Maximal overlap, the worst discriminability.

In general, ROC is not convex.
p(x|y)
33/44
Properties of the ROC space
the random
classifier, p=0.5
the ideal case
the worst case
always negative
always positive
better than the
random classifier
worse than random;
can be improved
by inverting its
predictions
false positive rate
true
positive
rate
34/44
‘Continuity’ of the ROC
Given two classifiers c1 and c2.

Classifiers c1 and c2 can be
linearly weighted to create
‘intermediate’ classifiers and

In such a way, a continuum of
classifiers can be imagined
which is denoted by a red line
in the figure.
false positive rate
true
positive
rate
c1
c2
35/44
More classifiers, ROC construction

The convex hull covering
classifiers in the ROC space is
constructed.

Classifiers on the convex hull
achieve always the best
performance for some class
probability distributions (i.e.,
the ratio of positive and
negative examples).

Classifiers inside the convex hull
perform worse and can be
discarded.

No penalties have been
involved so far. To come . . .
false positive rate
true
positive
rate
c1
c2
c8
c4
c3
c6
c5
c7
36/44
ROC convex hull and iso-accuracy

Each line segment on the convex hull is an iso-accuracy line for a particular
class distribution.
• Under that distribution, the two classifiers on the end-points achieve
the same accuracy.
• For distributions skewed towards negatives (steeper slope than 45o
),
the left classifier is better.
• For distributions skewed towards positives (flatter slope than 45o
), the
right classifier is better.

Each classifier on convex hull is optimal for a specific range of class
distributions.
37/44
Accuracy expressed in the ROC space
Accuracy a
a = pos · TPR + neg · (1 − FPR).
Express the accuracy in the ROC space
as TPR = f(FPR).
TPR =
a−neg
pos +
neg
pos · FPR.
Iso-accuracy are given by straight lines
with the slope neg/pos, i.e., by a
degree of balance in the test set. The
balanced set ⇔ 45o
.
The diagonal ∼ TPR = FPR = a.
A blue line is a iso-accuracy line for a
general classifier c2.
false positive rate
true
positive
rate
a
=
0
.
8
7
5
a = 0.875
a = 0.73
a
=
0
.
7
5
a
=
0
.
2
5
a
=
0
.
5
a
=
0
.
6
2
5
a
=
0
.
3
7
5
a
=
0
.
1
2
5
c2
38/44
Selecting the optimal classifier (1)
false positive rate
true
positive
rate
c1
c2
c4
c5
c7
0.78
For a balanced training set (i.e., as many +ves as -ves), see solid blue line.
Classifier c1 achieves ≈ 78 % true positive rate.
39/44
Selecting the optimal classifier (2)
false positive rate
true
positive
rate
c1
c2
c4
c5
c7
0.82
For a training set with half +ves than -ves), see solid blue line.
Classifier c1 achieves ≈ 82 % true positive rate.
40/44
Selecting the optimal classifier (3)
false positive rate
true
positive
rate
c1
c2
c4
c5
c7
0.11
For less than 11 % +ves, the always negative is the best.
41/44
Selecting the optimal classifier (4)
false positive rate
true
positive
rate
c1
c2
c4
c5
c7
0.8
For more than 80 % +ves, the always positive is the best.
42/44
Benefit of the ROC approach

It is possible to distinguish operationally between
Discriminability – inherent property of the detection system.
Decision bias – implied by the loss function changable by the user.

The discriminability can be determined from ROC curve.

If the Gaussian assumption holds then the Bayesian error rate can be
calculated.
43/44
ROC – generalization to arbitrary
multidimensional distribution

The approach can be used for any two multidimensional distributions
p(x|y1), p(x|y2) provided they overlap and thus have nonzero Bayes
classification error.

Unlike in one-dimensional case, there may be many decision boundaries
corresponding to particular true positive rate, each with different false
positive rate.

Consequently, we cannot determine discriminability from true positive and
false positive rate without knowing more about underlying decision rule.
44/44
Optimal true/false positive rates?
Unfortunately not

This is a rarely attainable ideal for a multidimensional case.

We should have found the decision rule of all the decision rules giving the
measured true positive rate, the rule which has the minimum false negative
rate.

This would need huge computational resources (like Monte Carlo).

In practice, we forgo optimality.
13ClassifierPerformance.pdf

More Related Content

Similar to 13ClassifierPerformance.pdf

MODEL EVALUATION.pptx
MODEL  EVALUATION.pptxMODEL  EVALUATION.pptx
MODEL EVALUATION.pptxKARPAGAMT3
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)NYversity
 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionMargaret Wang
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Non Parametric Test by Vikramjit Singh
Non Parametric Test by  Vikramjit SinghNon Parametric Test by  Vikramjit Singh
Non Parametric Test by Vikramjit SinghVikramjit Singh
 
V. pacáková, d. brebera
V. pacáková, d. breberaV. pacáková, d. brebera
V. pacáková, d. breberalogyalaa
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regionsbutest
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 
Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithmHoopeer Hoopeer
 
3 SAMPLING LATEST.pptx
3 SAMPLING LATEST.pptx3 SAMPLING LATEST.pptx
3 SAMPLING LATEST.pptxssuser504dda
 

Similar to 13ClassifierPerformance.pdf (20)

MODEL EVALUATION.pptx
MODEL  EVALUATION.pptxMODEL  EVALUATION.pptx
MODEL EVALUATION.pptx
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
 
Chapter10 Revised
Chapter10 RevisedChapter10 Revised
Chapter10 Revised
 
Chapter10 Revised
Chapter10 RevisedChapter10 Revised
Chapter10 Revised
 
Chapter10 Revised
Chapter10 RevisedChapter10 Revised
Chapter10 Revised
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and prediction
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Test of significance
Test of significanceTest of significance
Test of significance
 
Non Parametric Test by Vikramjit Singh
Non Parametric Test by  Vikramjit SinghNon Parametric Test by  Vikramjit Singh
Non Parametric Test by Vikramjit Singh
 
V. pacáková, d. brebera
V. pacáková, d. breberaV. pacáková, d. brebera
V. pacáková, d. brebera
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Data science
Data scienceData science
Data science
 
9618821.ppt
9618821.ppt9618821.ppt
9618821.ppt
 
9618821.pdf
9618821.pdf9618821.pdf
9618821.pdf
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithm
 
3 SAMPLING LATEST.pptx
3 SAMPLING LATEST.pptx3 SAMPLING LATEST.pptx
3 SAMPLING LATEST.pptx
 
Two Proportions
Two Proportions  Two Proportions
Two Proportions
 

Recently uploaded

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 

13ClassifierPerformance.pdf

  • 1. Classifier performance evaluation Václav Hlaváč Czech Technical University in Prague, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo náměstí 13, Czech Republic http://cmp.felk.cvut.cz/˜hlavac, hlavac@fel.cvut.cz LECTURE PLAN Classifier performance as the statistical hypothesis testing. Criterion functions. Confusion matrix, characteristics. Receiver operation curve (ROC). APPLICATION DOMAINS Classifiers (our main concern in this lecture), 1940s – radars, 1970s – medical diagnostics, 1990s – data mining. Regression. Estimation of probability densities.
  • 2. 2/44 Classifier experimental evaluation To what degree should we believe to the learned classifier? Classifiers (both supervised and unsupervised) are learned (trained) on a finite training multiset (named simply training set in the sequel for simplicity). A learned classifier has to be tested on a different test set experimentally. The classifier performs on different data in the run mode that on which it has learned. The experimental performance on the test data is a proxy for the performance on unseen data. It checks the classifier’s generalization ability. There is a need for a criterion function assessing the classifier performance experimentally, e.g., its error rate, accuracy, expected Bayesian risk (to be discussed later). A need for comparing classifiers experimentally.
  • 3. 3/44 Evaluation is an instance of hypothesis testing Evaluation has to be treated as hypothesis testing in statistics. The value of the population parameter has to be statistically inferred based on the sample statistics (i.e., a training set in pattern recognition). population statistical sampling sample statistic accuracy = 97% parameter accuracy = 95% statistical inference
  • 4. 4/44 Danger of overfitting Learning the training data too precisely usually leads to poor classification results on new data. Classifier has to have the ability to generalize. underfit fit overfit
  • 5. 5/44 Training vs. test data Problem: Finite data are available only and have to be used both for training and testing. More training data gives better generalization. More test data gives better estimate for the classification error probability. Never evaluate performance on training data. The conclusion would be optimistically biased. Partitioning of available finite set of data to training / test sets. Hold out. Cross validation. Bootstrap. Once evaluation is finished, all the available data can be used to train the final classifier.
  • 6. 6/44 Hold out method Given data is randomly partitioned into two independent sets. • Training multi-set (e.g., 2/3 of data) for the statistical model construction, i.e. learning the classifier. • Test set (e.g., 1/3 of data) is hold out for the accuracy estimation of the classifier. Random sampling is a variation of the hold out method: Repeat the hold out k times, the accuracy is estimated as the average of the accuracies obtained.
  • 7. 7/44 K-fold cross validation The training set is randomly divided into K disjoint sets of equal size where each part has roughly the same class distribution. The classifier is trained K times, each time with a different set held out as a test set. The estimated error is the mean of these K errors. Ron Kohavi, A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, IJCAI 1995. Example, courtesy TU Graz Courtesy: the presentation of Jin Tian, Iowa State Univ..
  • 8. 8/44 Leave-one-out A special case of K-fold cross validation with K = n, where n is the total number of samples in the training multiset. n experiments are performed using n − 1 samples for training and the remaining sample for testing. It is rather computationally expensive. Leave-one-out cross-validation does not guarantee the same class distribution in training and test data! The extreme case: 50% class A, 50% class B. Predict majority class label in the training data. True error 50%; Leave-one-out error estimate 100%! Courtesy: The presentation of Jin Tian, Iowa State Univ..
  • 9. 9/44 Bootstrap aggregating, called also bagging The bootstrap uses sampling with replacement to form the training set. Given: the training set T consisting of n entries. Bootstrap generates m new datasets Ti each of size n0 n by sampling T uniformly with replacement. The consequence is that some entries can be repeated in Ti. In a special case (called 632 boosting) when n0 = n, for large n, Ti is expected to have 1 − 1 e ≈ 63.2% of unique samples. The rest are duplicates. The m statistical models (e.g., classifiers, regressors) are learned using the above m bootstrap samples. The statistical models are combined, e.g. by averaging the output (for regression) or by voting (for classification). Proposed in: Breiman, Leo (1996). Bagging predictors. Machine Learning 24 (2): 123–140.
  • 10. 10/44 Recomended experimental validation procedure Use K-fold cross-validation (K = 5 or K = 10) for estimating performance estimates (accuracy, etc.). Compute the mean value of performance estimate, and standard deviation and confidence intervals. Report mean values of performance estimates and their standard deviations or 95% confidence intervals around the mean. Courtesy: the presentation of Jin Tian, Iowa State Univ..
  • 11. 11/44 Criterion function to assess classifier performance Accuracy, error rate. • Accuracy is the percent of correct classifications. • Error rate = is the percent of incorrect classifications. • Accuracy = 1 - Error rate. • Problems with the accuracy: Assumes equal costs for misclassification. Assumes relatively uniform class distribution (cf. 0.5% patients of certain disease in the population). Other characteristics derived from the confusion matrix (to be explained later). Expected Bayesian risk (to be explained later).
  • 12. 12/44 Confusion matrix, two classes only The confusion matrix is also called the contingency table. a TN - True Negative correct rejections c FN - False Negative misses, type II error overlooked danger d TP - True Positive hits b FP - False Positive false alarms type I error predicted actual examples negative negative positive positive R. Kohavi, F. Provost: Glossary of terms, Machine Learning, Vol. 30, No. 2/3, 1998, pp. 271-274. Performance measures calculated from the confusion matrix entries: Accuracy = (a + d)/(a + b + c + d) = (TN + TP)/total True positive rate, recall, sensitivity = d/(c + d) = TP/actual positive Specificity, true negative rate = a/(a + b) = TN/actual negative Precision, predicted positive value = d/(b + d) = TP/predicted positive False positive rate, false alarm = b/(a + b) = FP/actual negative = 1 - specificity False negative rate = c/(c + d) = FN/actual positive
  • 13. 13/44 Confusion matrix, # of classes 2 The toy example (courtesy Stockman) shows predicted and true class labels of optical character recognition for numerals 0-9. There were 100 examples of each number class available for the evaluation. Empirical performance is given in percents. The classifier allows the reject option, class label R. Notice, e.g., unavoidable confusion between 4 and 9. class j predicted by a classifier true class i ‘0’ ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘6’ ‘7’ ‘8’ ‘9’ ‘R’ ‘0’ 97 0 0 0 0 0 1 0 0 1 1 ‘1’ 0 98 0 0 1 0 0 1 0 0 0 ‘2’ 0 0 96 1 0 1 0 1 0 0 1 ‘3’ 0 0 2 95 0 1 0 0 1 0 1 ‘4’ 0 0 0 0 98 0 0 0 0 2 0 ‘5’ 0 0 0 1 0 97 0 0 0 0 2 ‘6’ 1 0 0 0 0 1 98 0 0 0 0 ‘7’ 0 0 1 0 0 0 0 98 0 0 1 ‘8’ 0 0 0 1 0 0 1 0 96 1 1 ‘9’ 1 0 0 0 3 1 0 0 0 95 0
  • 14. 14/44 Unequal costs of decisions Examples: Medical diagnosis: The cost of falsely indicated breast cancer in population screening is smaller than the cost of missing a true disease. Defense against ballistic missiles: The cost of missing a real attack is much higher than the cost of false alarm. Bayesian risk is able to represent unequal costs of decisions. We will show that there is a tradeoff between apriori probability of the class and the induced cost.
  • 15. 15/44 Criterion, Bayesian risk For set of observations X, set of hidden states Y and decisions D, statistical model given by the joint probability pXY : X × Y → R and the penalty function W: Y × D → R and a decision strategy Q: X → D the Bayesian risk is given as R(Q) = X x∈X X y∈Y pXY (x, y) W(y, Q(x)) . It is difficult to fulfil the assumption that a statistical model pXY is known in many practical tasks. If the Bayesian risk would be calculated on the finite (training) set then it would be too optimistic. Two substitutions for Baysesian risk are used in practical tasks: • Expected risk. • Structural risk.
  • 16. 16/44 Two paradigms for learning classifiers Choose a class Q of decision functions (classifiers) q: X → Y . Find q∗ ∈ Q by minimizing some criterion function on the training set that approximates the risk R(q) (which cannot be computed). Learning paradigm is defined by the criterion function: 1. Expected risk minimization in which the true risk is approximated by the error rate on the training set, Remp(q(x, Θ)) = 1 L L X i=1 W(yi, q(xi, Θ)) , Θ∗ = argmin Θ Remp(q(x, Θ)) . Examples: Perceptron, Neural nets (Back-propagation), etc. 2. Structural risk minimization which introduces the guaranteed risk J(Q), R(Q) J(Q). Example: SVM (Support Vector Machines).
  • 17. 17/44 Problem of unknown class distribution and costs We already know: The class distribution and the costs of each error determine the goodness of classifiers. Additional problem: In many circumstances, until the application time, we do not know the class distribution and/or it is difficult to estimate the cost matrix. E.g., an email spam filter. Statistical models have to be learned before. Possible solution: Incremental learning.
  • 18. 18/44 Unbalanced problems and data Classes have often unequal frequency. • Medical diagnosis: 95 % healthy, 5% disease. • e-Commerce: 99 % do not buy, 1 % buy. • Security: 99.999 % of citizens are not terrorists. Similar situation for multiclass classifiers. Majority class classifier can be 99 % correct but useless. Example: OCR, 99 % correct, error at the every second line. This is why OCR is not widely used. How should we train classifiers and evaluated them for unbalanced problems?
  • 19. 19/44 Balancing unbalanced data Two class problems: Build a balanced training set, use it for classifier training. • Randomly select desired number of minority class instances. • Add equal number of randomly selected majority class instances. Build a balanced test set (different from training set, of course) and test the classifier using it. Multiclass problems: Generalize ‘balancing’ to multiple classes. Ensure that each class is represented with approximately equal proportions in training and test datasets.
  • 20. 20/44 Thoughts about balancing Natural hesitation: Balancing the training set changes the underlying statistical problem. Can we change it? Hesitation is justified. The statistical problem is indeed changed. Good news: Balancing can be seen as the change of the penalty function. In the balanced training set, the majority class patterns occur less often which means that the penalty assigned to them is lowered proportionally to their relative frequency. R(q∗ ) = min q∈D X x∈X X y∈Y pXY (x, y) W(y, q(x)) R(q∗ ) = min q(x)∈D X x∈X X y∈Y p(x) pY |X(y|x) W(y, q(x)) This modified problem is what the end user usually wants as she/he is interested in a good performance for the minority class.
  • 21. 21/44 Scalar characteristics are not good for evaluating performance Scalar characteristics as the accuracy, expected cost, area under ROC curve (AUC, to be explained soon) do not provide enough information. We are interested in: • How are errors distributed across the classes? • How will each classifier perform in different testing conditions (costs or class ratios other than those measured in the experiment)? Two numbers – true positive rate and false positive rate – are much more informative than the single number. These two numbers are better visualized by a curve, e.g., by a Receiver Operating Characteristic (ROC), which informs about: • Performance for all possible misclassification costs. • Performance for all possible class ratios. • Under what conditions the classifier c1 outperforms the classifier c2?
  • 22. 22/44 ROC – Receiver Operating Characteristic Called also often ROC curve. Originates in WWII processing of radar signals. Useful for the evaluation of dichotomic classifiers performance. Characterizes degree of overlap of classes for a single feature. Decision is based on a single threshold Θ (called also operating point). Generally, false alarms go up with attempts to detect higher percentages of true objects. A graphical plot showing (hit rate, false alarm rate) pairs. Different ROC curves correspond to different classifiers. The single curve is the result of changing threshold Θ. 0 0 100 % 100 % 50 50 f rate alse positive true positive rate
  • 23. 23/44 A model problem Receiver of weak radar signals Suppose a receiver detecting a single weak pulse (e.g., a radar reflection from a plane, a dim flash of light). A dichotomic decision, two hidden states y1, y2: • y1 – a plane is not present (true negative) or • y2 – a plane is present (true positive). Assume a simple statistical model – two Gaussians. Internal signal of the receiver, voltage x with the mean µ1 when the plane (external signal) is not present and µ2 when the plane is present. Random variables due to random noise in the receiver and outside of it. p(x|yi) = N(µi, σ2 i ), i = 1, 2.
  • 24. 24/44 Four probabilities involved Let suppose that the involved probability distributions are Gaussians and the correct decision of the receiver are known. The mean values µ1, µ2, standard deviations σ1, σ1, and the threshold Θ are not known too. There are four conditional probabilities involved: • Hit (true positive) p(x Θ | x ∈ y2). • False alarm, type I error (false positive) p(x Θ | x ∈ y1). • Miss, overlooked danger, type II error (false negative) p(x Θ | x ∈ y2). • Correct rejection (true negative) p(x Θ | x ∈ y1).
  • 25. 25/44 Radar receiver example, graphically p(x|y ), i=1,2 i p(x|y ) 1 p(x|y ) 2 x threshold Q hits, true positives false alarms false positives Any decision threshold Θ on the voltage x, x Θ, determines: Hits – their probability is the hatched area under the curve p(x|y1). False alarms – their probability is the red filled area under the curve p(x|y2). Courtesy to Silvio Maffei for pointing to the mistake in the picture.
  • 26. 26/44 ROC – Example A Two Gaussians with equal variances Two Gaussians, µ1 = 4.0, µ2 = 6.0, σ1 = σ2 = 1.0 Less overlap, better discriminability. In this special case, ROC is convex. p(x|y) x 1.0 true positive rate 0.5 0.25 0.75 0.0 0.0 0.75 0.25 false positive rate 0.5 1.0 ROC curve
  • 27. 27/44 ROC – Example B Two gaussians with equal variances Two Gaussians, µ1 = 4.0, µ2 = 5.0, σ1 = σ2 = 1.0 More overlap, worse discriminability. In this special case, ROC is convex. p(x|y) 1.0 true positive rate 0.5 0.25 0.75 0.0 0.0 0.75 0.25 false positive rate 0.5 1.0 ROC curve
  • 28. 28/44 ROC – Example C Two gaussians with equal variances Two Gaussians, µ1 = 4.0, µ2 = 4.1, σ1 = σ2 = 1.0 Almost total overlap, almost no discriminability. In this special case, ROC is convex. p(x|y)
  • 29. 29/44 ROC and the likelihood ratio Under the assumption of the Gaussian signal corrupted by the Gaussian noise, the slope of the ROC curve equals to the likelihood ratio L(x) = p(x|noise) p(x|signal) . In the even more special case, when standard deviations σ1 = σ2 then • L(x) increases monotonically with x. Consequently, ROC becomes a convex curve. • The optimal threshold Θ becomes Θ = p(noise) p(signal) .
  • 30. 30/44 ROC – Example D Two gaussians with different variances Two Gaussians, µ1 = 4.0, µ2 = 6.0, σ1 = 1.0, σ2 = 2.0 Less overlap, better discriminability. In general, ROC is not convex. p(x|y) 1.0 true positive rate 0.5 0.25 0.75 0.0 0.0 0.75 0.25 false positive rate 0.5 1.0 ROC curve
  • 31. 31/44 ROC – Example E Two gaussians with different variances Two Gaussians, µ1 = 4.0, µ2 = 4.5, σ1 = 1.0, σ2 = 2.0 More overlap, worse discriminability. In general, ROC is not convex. p(x|y) 1.0 true positive rate 0.5 0.25 0.75 0.0 0.0 0.75 0.25 false positive rate 0.5 1.0 ROC curve
  • 32. 32/44 ROC – Example F Two gaussians with different variances Two Gaussians, µ1 = 4.0, µ2 = 4.0, σ1 = 1.0, σ2 = 2.0 Maximal overlap, the worst discriminability. In general, ROC is not convex. p(x|y)
  • 33. 33/44 Properties of the ROC space the random classifier, p=0.5 the ideal case the worst case always negative always positive better than the random classifier worse than random; can be improved by inverting its predictions false positive rate true positive rate
  • 34. 34/44 ‘Continuity’ of the ROC Given two classifiers c1 and c2. Classifiers c1 and c2 can be linearly weighted to create ‘intermediate’ classifiers and In such a way, a continuum of classifiers can be imagined which is denoted by a red line in the figure. false positive rate true positive rate c1 c2
  • 35. 35/44 More classifiers, ROC construction The convex hull covering classifiers in the ROC space is constructed. Classifiers on the convex hull achieve always the best performance for some class probability distributions (i.e., the ratio of positive and negative examples). Classifiers inside the convex hull perform worse and can be discarded. No penalties have been involved so far. To come . . . false positive rate true positive rate c1 c2 c8 c4 c3 c6 c5 c7
  • 36. 36/44 ROC convex hull and iso-accuracy Each line segment on the convex hull is an iso-accuracy line for a particular class distribution. • Under that distribution, the two classifiers on the end-points achieve the same accuracy. • For distributions skewed towards negatives (steeper slope than 45o ), the left classifier is better. • For distributions skewed towards positives (flatter slope than 45o ), the right classifier is better. Each classifier on convex hull is optimal for a specific range of class distributions.
  • 37. 37/44 Accuracy expressed in the ROC space Accuracy a a = pos · TPR + neg · (1 − FPR). Express the accuracy in the ROC space as TPR = f(FPR). TPR = a−neg pos + neg pos · FPR. Iso-accuracy are given by straight lines with the slope neg/pos, i.e., by a degree of balance in the test set. The balanced set ⇔ 45o . The diagonal ∼ TPR = FPR = a. A blue line is a iso-accuracy line for a general classifier c2. false positive rate true positive rate a = 0 . 8 7 5 a = 0.875 a = 0.73 a = 0 . 7 5 a = 0 . 2 5 a = 0 . 5 a = 0 . 6 2 5 a = 0 . 3 7 5 a = 0 . 1 2 5 c2
  • 38. 38/44 Selecting the optimal classifier (1) false positive rate true positive rate c1 c2 c4 c5 c7 0.78 For a balanced training set (i.e., as many +ves as -ves), see solid blue line. Classifier c1 achieves ≈ 78 % true positive rate.
  • 39. 39/44 Selecting the optimal classifier (2) false positive rate true positive rate c1 c2 c4 c5 c7 0.82 For a training set with half +ves than -ves), see solid blue line. Classifier c1 achieves ≈ 82 % true positive rate.
  • 40. 40/44 Selecting the optimal classifier (3) false positive rate true positive rate c1 c2 c4 c5 c7 0.11 For less than 11 % +ves, the always negative is the best.
  • 41. 41/44 Selecting the optimal classifier (4) false positive rate true positive rate c1 c2 c4 c5 c7 0.8 For more than 80 % +ves, the always positive is the best.
  • 42. 42/44 Benefit of the ROC approach It is possible to distinguish operationally between Discriminability – inherent property of the detection system. Decision bias – implied by the loss function changable by the user. The discriminability can be determined from ROC curve. If the Gaussian assumption holds then the Bayesian error rate can be calculated.
  • 43. 43/44 ROC – generalization to arbitrary multidimensional distribution The approach can be used for any two multidimensional distributions p(x|y1), p(x|y2) provided they overlap and thus have nonzero Bayes classification error. Unlike in one-dimensional case, there may be many decision boundaries corresponding to particular true positive rate, each with different false positive rate. Consequently, we cannot determine discriminability from true positive and false positive rate without knowing more about underlying decision rule.
  • 44. 44/44 Optimal true/false positive rates? Unfortunately not This is a rarely attainable ideal for a multidimensional case. We should have found the decision rule of all the decision rules giving the measured true positive rate, the rule which has the minimum false negative rate. This would need huge computational resources (like Monte Carlo). In practice, we forgo optimality.