Assessing Classification Uncertainty from the Perspective of End-Users

Assessing Classification Uncertainty
From the Perspective of End-Users
Emma Beauxis-Aussalet
May 2018 1

Use Case
2
We count animals
to study ecosystems

Use Case
3
We count animals
to study ecosystems
Which species
live here?

Use Case
4
We count animals
to study ecosystems
Which species
live here?
How many animals
per species?

Use Case
5
Use computer vision!
You won’t disturb us
and you’ll save money…

Use Case
6
...but we need to assess
its scientific validity
Use computer vision!
You won’t disturb us
and you’ll save money…

Classification Problem
7
Collect examples of animals
(Ground-Truth)

8
(Ground-Truth)
Construct models
for each species

9
(Ground-Truth)
Construct models
for each species
Classify animals’ species
as the most similar model

10
What can
go wrong?

11
Some species
are misclassified

12
Some species
are misclassified
Some fish
are not detected

13
Some species
are misclassified
Other objects are
classified as fish
Some fish
are not detected

14
Some species
are misclassified
Other objects are
classified as fish
Some fish
are not detected
Why?
When?
How many?

How to communicate
the uncertainty?
15
Here the Octopus appeared.
(½Φ )-(π√⅞)
How precise is this?
May 2018

Communication Problems
16
Why should we communicate the uncertainty?
Make informed decisions when
choosing and tuning classifiers
Estimate noise and biases in
classification results

17

Issues with Classifier Evaluation
18

19
It’s very tedious

20
It’s very tedious
I’m not confident
in my decisions

21
It’s very tedious
I’m not confident
in my decisions
The terminology
confuses me

22
It’s very tedious
I’m not confident
in my decisions
The terminology
confuses me
I often confuse
FP and FN

23
It’s very tedious
I’m not confident
in my decisions
The terminology
confuses me
I don’t understand
the impact on
end-results
I often confuse
FP and FN

Interpretation
Numbers Proportions
33

Interpretation
Numbers Proportions
Sources of bias
34

Interpretation
Numbers Proportions
Collect more
groundtruth?
35

Interpretation
Numbers Proportions
Improve
classifier?
36

Classee Project & Online Tool
37
http://classee.project.cwi.nl
ADS project with:
Joost van Doorn
Max Welling
Lynda Hardman

Potential Use
38
Improve internal communication
and decision making
Build trust with partners or clients

39

40

Issues with Estimating Classification Errors
Time
Count of Items per Class over time
NumberofItems

Time
NumberofItems
Class A
Class A increases a lot

Time
NumberofItems
Class A
Class B
Minority Class B increases too

Time
NumberofItems
Class A
Class B
Class A items misclassified as
Class B increase too

Time
NumberofItems
Class A
Class B
Does Class B increase only
because of errors from Class A?

Time
NumberofItems
Class A
Class B
Does Class B increase only
because of errors from Class A?
Within the items classified as Class B,
how many truly belong to Class A?

Reclassification Method
Error rates based on
output class size
(e.g., Precision)

Number of items
output class size
(e.g., Precision)

Number of items
truly belonging to Class X
output class size
(e.g., Precision)

Number of items
and classified as Class Y
output class size
(e.g., Precision)

Number of items
and classified as Class Y
output class size
(e.g., Precision)
Total mumber of items
classified as Class Y
(output class size)

AssA
Assumption

AssA
Assumption
Target set

Error
Decomposition
for the target set

Class Size
for the target set

This method is biased
if class proportions
change

This method is biased
if class proportions
change
Time
NumberofItems

Misclassification Method
This method is robust
to changes in class
proportions

Total number of items
(true class size)
true class size
(e.g., Recall)

Results with UCI datasets
Class size estimates for 100 random splits in training, test and target sets
(Naïve bayes classifier with 10-fold cross validation)

Classifier
Output

Misclassification
method

Reclassification
method

The Misclassification method
is robust to changes in class proportions

The Misclassification method
is robust to changes in class proportions
But its results have
higher variance

Variance Estimation
Methods exist to estimate the variance
of the Reclassification & Misclassification methods

Variance Estimation
They are applicable for test sets randomly sampled
within the target sets
Target Set
Test Set

Variance Estimation
They are applicable for test sets randomly sampled
within the target sets
They are not applicable for disjoint test and target sets
Target Set
Test Set
Target Set
Test Set

Sample-to-Sample Method
The Sample-to-Sample method addresses disjoint test and target set
Target Set
Test Set

The Sample-to-Sample method addresses disjoint test and target set
randomly sampled within the same population
Target Set Test Set
Population
with actual Class X

Population’s
Error Rates
Target Set’s
Error Rates Test Set’s
Error Rates

Sample-to-Population
variance estimates
Population’s
Error Rates
Target Set’s
Error Rates

Sample-to-Population
variance estimates
Population-to-Sample
variance estimates
Population’s
Error Rates
Target Set’s
Error Rates

Sample-to-Sample
variance estimates
Target Set’s
Error Rates
Population’s
Error Rates
Test Set’s
Error Rates

Distribution of target set’s error rates
estimated from test set’s error rates

Normal distribution
explainable with the
Central Limit Theorem

Same mean for
test and target sets

Variance w.r.t. test set

Variance w.r.t. target set

We use the class size estimates
from the Misclassification method
Variance w.r.t. target set

Evaluation using known n’x. to derive

We estimated 68% CI because
we can observe more variations than with 95% CI

Application of Sample-to-Sample Method

?!

Let’s start with binary problems,
we can expressing their solutions in a simpler form

These are ratios of random variables…
(Cauchy distribution)

These are ratios of random variables…
...but they are correlated
(Fieller’s theorem)

Fieller’s Theorem estimates confidence intervals’ limits
for ratios of correlated random variables

Evaluation of Sample-to-Sample applied with Fieller’s theorem
using estimated to derive

We achieve
accurate confidence intervals
for class size estimates

…but intervals can be
very large for small class sizes
We achieve

We achieve
…but intervals can be
very large for small class sizes
…and inaccurate for very small
class sizes or error rates

Multiclass problems are difficult to express as a ratio
compatible with Fieller’s theorem
…but bootstrapping and simulations can address multiclass problems

Potential Use
97
Handle classification errors
in traffic predictions,
or computer vision systems …

Future Work
99
Variance estimation for multiclass problems

Future Work
100
Guidelines for balancing the sizes of test and target sets
(smaller training sets but larger test sets may improve error estimation)

Future Work
101
Predict variance magnitude without knowledge of the target sets
(Maximum Determinant method)

Future Work
102
Handle shifts of error rates and feature distributions
(domain adaptation, e.g., with Bayesian classifiers)

Future Work
103
Fully-specified guidelines for choosing between
Reclassification or Misclassification methods. Or none.
(depending on number of classes, class sizes in test and target sets,
error rate magnitude, shifts of features distribution)
Handle shifts of error rates and feature distributions
(domain adaptation, e.g., with Bayesian classifiers)

Future Work
104
Visualizations of variance estimates
e.g., for potential target sets

Future Work
105
Uncertainty propagation in pipelines of classifiers
e.g., with different test sets

Future Work
106
Uncertainty propagation in pipelines of classifiers
e.g., with different test sets
Identify individual misclassifications

107May 2018
Thank you!
Questions?

108May 2018
Thank you!
Questions?
Other uncertainty factors?
Shift of feature distributions?
Predic variance magnitude?
Misclassification method
explained?

October 2017 IEEE Conference on Data Science and Advanced Analytics (DSAA)
Varying Feature Distributions

If the feature distributions vary between test and target set,
classifiers may behave differently

The error rates may systematically differ between test and target sets,
and the Misclassification and Reclassification Methods
can greatly worsen the classification biases
If the feature distributions vary between test and target set,
classifiers may behave differently

Examples with simulated data

Future work is required to handle varying feature distributions

Regressions can be fit to infer error rates from feature values,
but this approach is more complex with the Misclassification Method

Regressions can be fit to infer error rates from feature values,
but this approach is more complex with the Misclassification Method
…but the Misclassification Method can be used to
refine priors in Bayesian classifiers
(i.e., the unconditional class probabilities)

Ratio-to-TP Method
This method gives
exactly the same results as
the Misclassification
method

Ratio-to-TP Method
Atypical Ratio

Ratio-to-TP Method
Number of
True Positives (TP)
Atypical Ratio

Ratio-to-TP Method
Number of TP
for the target set

May 2018
Predicting the Results Variance
Maximum Determinant Method
123

When starting an application, several classifiers may be available
with no knowledge of the potential target sets

To choose a classifier, the Maximum Determinant Method aims at
predicting which classifier yields the smallest variance
when applying the Misclassification Method
When starting an application, several classifiers may be available
with no knowledge of the potential target sets

Hypothesis: The higher the determinant of the error rate matrix,
the lower the results variance.

Hypothesis: The higher the determinant of the error rate matrix,
the lower the results variance.
Inspired by
Cramer’s rule

Evaluation with UCI datasets

Comparison between Misclassification & Ratio-to-TP error rate matrices

Initial results are promising but theory must be established

What are the parameters of relationship between the determinant
and the variance of misclassification results?
(number of classes, class sizes in test and target sets, error rate magnitude)

Binary problems for which the method is irrelevant?

Problems for which Misclassification or Ratio-to-TP error rates
provide better predictors?
Binary problems for which the method is irrelevant?

Assumption

Error
Decomposition

Class size estimates
are needed
Error
Decomposition

Solution of the
linear equations

True class size
estimates
for the target set

Ratio-to-TP Method
Error
Decomposition

Ratio-to-TP Method
Error
Decomposition
TP estimates
are needed

Ratio-to-TP Method
Always invertible if all
c = number of classes

Interactions of Uncertainty Factors
149
Classification
Errors

150
Poor ground-truth
yields poor models
Ground-Truth
Quality
Classification
Errors

151
Poor images
yield more errors
Ground-Truth
Quality
Classification
Errors
Image Quality

152
Typhoons yield poor images? (bias)
What confidence intervals? (noise)
Ground-Truth
Quality
Classification
Errors
Biases & Noise
in Specific Output
Image Quality

153
Missing videos?
Sampling
Coverage
Ground-Truth
Quality
Classification
Errors
Biases & Noise
in Specific Output
Image Quality

154
Some species often move
in & out the field of view
Sampling
Coverage
Duplicated
Individuals
Ground-Truth
Quality
Classification
Errors
Biases & Noise
in Specific Output
Image Quality

155
Fields of view
target specific habitats
Sampling
Coverage
Duplicated
Individuals
Field of View
Ground-Truth
Quality
Classification
Errors
Biases & Noise
in Specific Output
Image Quality

156
Fields of view
target specific habitats
and shift overtime
Sampling
Coverage
Duplicated
Individuals
Field of View
Ground-Truth
Quality
Classification
Errors
Biases & Noise
in Specific Output
Image Quality

157
Sampling
Coverage
Duplicated
Individuals
Field of View
Ground-Truth
Quality
Classification
Errors
Biases & Noise
in Specific Output
Image Quality

158

Lessons Learned
159
Uncertainty factors arise from the system
and its deployement conditions

Lessons Learned
160
Investigations should include
domain experts and technical experts
…and non-experts!

Lessons Learned
161
Investigations should include
domain experts and technical experts
…and non-experts!
People need to feel comfortable
to engage in criticism

Classee: Experimental Results
163

Classee: Experimental Results
164

Existing Metrics & Visualizations
165
Confusion
Matrix

166
Typical
Visualization

167

168

169
Diagonals are correct animals.
(TP)
The rest are errors.

170
Columns are missed animals.
(FN)

171
Rows are added animals.
(FP)

172

173
Rows & Columns
are cumulated

174
Advanced
measurements
are repeated

176
Issues Tackled
Some metrics
conceal uncertainty

177
Issues Tackled
Using one single type of curve
can hide differences
Some metrics
conceal uncertainty

178
Issues Tackled
Some metrics
conceal uncertainty
The metrics omit which species
are confused with another

179
Issues Tackled
Some metrics
conceal uncertainty
…and omit species proportions
The metrics omit which species
are confused with another

Assessing Classification Uncertainty from the Perspective of End-Users

Recommended

Recommended

More Related Content

Similar to Assessing Classification Uncertainty from the Perspective of End-Users

Similar to Assessing Classification Uncertainty from the Perspective of End-Users (20)

Recently uploaded

Recently uploaded (20)

Assessing Classification Uncertainty from the Perspective of End-Users

Editor's Notes