Imprecision in learning: an overview

Imprecision in (statistical) learning: an incomplete
overview
Sébastien Destercke
Heudiasyc, CNRS Compiegne, France
UTC Data science seminar
Sébastien Destercke (CNRS) Imprecision and learning UTC Data science seminar 1 / 56

Basic setting
Plan
1 Basic setting
Setting the learning framework
Model selection (by loss minimisation)
2 Imprecision in learning
Imprecise data (and precise models)
Imprecision in models
Imprecision in predictions

Basic setting Setting the learning framework
Outline
1 Basic setting

The basic (supervised) setting
You consider a parametrized set Θ of possible models
You observe a bunch of input/output pairs (xi, yi) over X × Y:
X: input space
Y: output space
From them, learn a predictive model with parameters θ̂ ∈ Θ
A model θ takes as input x, and can typically output:
A probability p(y|x) over Y (e.g., logistic regression)
A real-valued score s(y|x) over Y (e.g., SVM)
One of the element in Y (e.g., Nearest Neighbour)

Classification: Y finite set
X = R2
Y = { , }
Θ = {θ1, θ2}
X2
X1
(x, y)
a
θ2 (X1
> a)
b θ1 (X2
< b)

Regression: Y continuous
X = R
Y = R
Θ = {(a, b) ∈ R2} → θ(x) = a · x + b
x
y
(x, y)
θ

The classical scheme
Precise
data (xi, yi)
Precise
model θ
Precise
prediction
θ(x) = y
Induction
principle
Inference/Decision
rule
→ this talk: what if one the step becomes imprecise/partial (by
constraint or by design)?

Basic setting Model selection (by loss minimisation)
Outline
1 Basic setting

Loss and selection
`(ŷ, y): loss incurred by predicting ŷ if y is observed.
A model θ will produce predictions θ(x), and its global loss on
observed training data (xi, yi) will be evaluated as1
Remp(θ) =
N
X
i=1
`(θ(xi), yi)
possibly regularizing to avoid overfitting (not this talk topic)
The optimal model is
θ∗
= arg min
θ∈Θ
Remp(θ),
the one with lowest possible average loss
1
Used as approximation of R(θ) =
R
X×Y
`(θ(x), y)dP(x, y).

Classification: Y finite set
`0/1(ŷ, y) =
(
1 if ŷ 6= y
0 if ŷ = y
Remp(θ2) = 1/13 → θ∗ = θ2
Remp(θ1) = 2/13
X2
X1
(x, y)
a
θ2 (X1
> a)
b θ1 (X2
< b)

Illustrations
Regression
L y, ŷ

= (y − ŷ)2
x
y
hθ∗
Classification (binary log reg)
L(y, p) =

− log(p) if y = 1
− log(1 − p) if y = 0
x
y
hθ∗

Some additional notes
The function Remp induces a complete order between all
models → up to indifference, best model unambiguously defined2
The likelihood function
L(θ|(x·, y·)) =
n
Y
i=1
p(xi, yi|θ)
or the Bayesian posterior
P(θ|(x·, y·)) ∝ L(θ|(x·, y·)) · P(θ)
also induces numerical scores that completely order models θ.
P(θ|(x·, y·)) also provides (“meaningful”) probabilistic weights.
2
Convexifying, a common ML game, ensures computability and 6 ∃ of indifference.

Imprecision in learning
Plan
1 Basic setting

Imprecision in learning Imprecise data (and precise models)
Outline
1 Basic setting

Induction with imprecise data
We observe possibly imprecise input/output (X, Y) containing the
truth (one (x, y) ∈ (X, Y) are true, unobserved values)
Losses3 both become set-valued [2]:
`(θ(X), Y) = {`(θ(X), Y)|y ∈ Y, x ∈ X}
Previous induction principles are no longer well-defined
What if we still want to get one model?
3
And likelihoods/posteriors alike

The imprecise setting illustrated
Regression
x
y
Classification (binary log reg)
x
y
How to define hθ∗ ?

Illustration on toy example
`0/1(ŷ, y)
R(θ) =
P
i min(xi ,yi )∈(Xi ,Yi ) `(θ(xi), yi) → best case scenario
R(θ) =
P
i max(xi ,yi )∈(Xi ,Yi ) `(θ(xi), yi) → worst case scenario
X1
X2
1 2
3
4
5
θ2
θ1
[R(θ1), R(θ1)] = [0, 5/13]
[R(θ2), R(θ2)] = [1/13, 3/13]

Going back to a precise model
If we know the “imprecisiation” process Pobs((X, Y)|(x, y)), no
theoretical problem → “merely” a computational one
If not, common approaches are to redefine a precise criterion:
Optimistic (Maximax/Minimin) approach [8, 1]:
`opt (θ(x), Y) = min{`(θ(x), Y)|y ∈ Y}
Pessimistic (Maximin/Minimax) approach [6]:
`pes(θ(x), Y) = max{`(θ(x), Y)|y ∈ Y}
EM-like or averaging/weighting approaches4
approach
`w (θ(x), Y) =
X
y∈Y
wy `(θ(x), y),
4
With likelihood ∼ Lav (θ|(x, Y)) = P((x, Y)|θ) [4]

Not a trivial choice: regression example
Pessimistic tries to be good for every replacement
Optimistic tries to be the best for one replacement

A logistic regression example
OPT
PESS

Which one should I be?
Optimist . . .
or. . .
Pessimist?
→ pretty much depends on the context!

Some elements of answer
When to be optimist?
Reasonably sure model space Θ can capture a good predictor
and is not too flexible (overfitting!)
“imprecisiation” process random/not designed to make you fail
can capture the best model
Optimism ' semi-sup. learning if imprecision=missingness.
When to be pessimist?
want to obtain guarantees in all possible scenarios ('
distributional robustness)
facing an “adversarial” process
partial data=set of situations for which you want to perform
reasonably well

Beyond imprecise data: soft data
Assume two classes {a, b}. We can put different uncertainty models
over them
1
a b
Certain label
1
a b
Imprecise label
1
a b
α
1 − α
Prob. label
1
a b
Possibilistic label
→ utility to consider more complex models/cases?

Imprecision in learning Imprecision in models
Outline
1 Basic setting

Models and ordering
In the classical scheme, models are completely ranked
θ1 θ2 θ3 θn
R(θ1) R(θ2) R(θ3) R(θn)

Models and ordering
And we pick the top one
θ(1) θ(2) θ(3) θ(n)
R(θ(1)) R(θ(2)) R(θ(3)) R(θ(n))


Model weighting
Ensembles, Bayes posteriors, etc → weights over models
θ(1) θ(2) θ(3) θ(n)
R(θ(1)) R(θ(2)) R(θ(3)) R(θ(n))

P(θ1|x, y) P(θ2|x, y) P(θ3|x, y) P(θn|x, y)
But still no imprecision

Outline
1 Basic setting
Imprecise data
Other ways to get imprecise models
Why looking for an imprecise model?
How to get an imprecise prediction?
How to evaluate an imprecise prediction?

Back to the toy example
`0/1(ŷ, y)
Unless we commit to a behaviour, models θ1, θ2 incomparable
X1
X2
1 2
3
4
5
θ2
θ1
[R(θ1), R(θ1)] = [0, 5/13]
[R(θ2), R(θ2)] = [1/13, 3/13]

Induced partial order
Each model is now set- or interval-valued
θ1 θ2 θ3 θn
[R(θ1), R(θ1)] [R(θ2), R(θ2)] [R(θ3), R(θ3)] [R(θn), R(θn)]
θi θj for sure if R(θi) R(θj)
this is known as an interval-order
(very) safe bet: take all maximal models θ, i.e., without θ0 θ
works also if [R(θi), R(θi)] is a statistical confidence interval

Outline
1 Basic setting
Imprecise data

Sets of best models
Not taking the best, but the k-best
θ(1) θ(2) θ(3) θ(k) θ(k+1)
R(θ(1)) R(θ(2)) R(θ(3)) R(θ(k)) R(θ(k+1))

One common way to do it [5] (dates back to Birnbaum, at least):
Normalize likelihood by computing
L∗
(θ|(x·, y·)) =
L(θ|(x·, y·))
arg supθ L(θ|(x·, y·))
Take as set estimate cut of level α
Θα = {θ|L∗
(θ|(x·, y·)) ≥ α}

Robust Bayes and imprecise probabilities [12, 14]
Consider a set of priors, and its corresponding set of posteriors
θ(1) θ(2) θ(3) θ(n)
R(θ(1)) R(θ(2)) R(θ(3)) R(θ(n))

[P(θ1), P(θ1)] [P(θ2), P(θ2)] [P(θ3), P(θ3)] [P(θn), P(θn)]
6= from pure Bayesian approach, as priors are not weighted5
5
Anarchy!

Outline
1 Basic setting
Imprecise data

Yes, why?
Because...

Other reasons
You want to make some robustness analysis around your top
models or your weighting scheme (because of limited data, of the
fact that they are not the theoretical optimal ones, . . . );
You suspect the observed data will be different6 from the training
ones (transfer learning, distributional robustness [9]);
You want a rich uncertainty quantification where there is a clear
distinction between aleatory uncertainty (irreducible, due to fixed
learning setting) and epistemic uncertainty (reducible by collecting
information). This can be used to:
produce cautious predictions (see next slides)
perform active learning [10]
explain uncertainty sources (largely unexplored topic)
6
In practice, issued from a distribution Ptest (X, Y) 6= Ptrain(X, Y)

Two kinds of uncertainties
Aleatory uncertainty: classes are really mixed → irreducible with
more data (but possibly by adding features)
Epistemic uncertainty: lack of information → reducible
X2
X1
x
a
a
a
b
b
b
Aleatory uncertainty
P(a) ∈ [0.45, 0.55]
X2
X1
x
a
b
Epistemic uncertainty
P(a) ∈ [0.2, 0.8]

Imprecision in learning Imprecision in predictions
Outline
1 Basic setting

Imprecise decisions: illustration
Predicting over Y = {a, b, c}
Precise predictions Imprecise predictions

Outline
1 Basic setting
Imprecise data

Some context
You allow your model θ to output more than one class/value7:
θ(x) ⊂ Y
Some questions:
Ensure trade-off b/w information (θ(x) small) and accuracy (ytrue ∈ θ(x))?
Evaluate the quality of θ(x)?
Given confidence α, how to ensure global coverage
P(ytrue ∈ θ(x)) ≥ α
Given confidence α and x, how to ensure local coveragea
P(ytrue ∈ θ(x)|x) ≥ α
a
Much, much more difficult.
7
Yes, this is classical in regression, less so in other frameworks.

Probabilistic partial reject [3, 7]
Assume we have p(y|x) as training output
Fix a confidence value α ∈ [0, 1]
Consider the permutation () on Y such that
p(y(1)
|x) ≥ p(y(2)
|x) ≥ . . . ≥ p(y(K)
|x)
Take all classes until cumulated probability is above
θ(x) = {y(1)
, . . . , y(j)
:
j−1
X
i=1
p(y(i)
|x) ≤ α,
j
X
i=1
p(y(i)
|x) ≥ α}
Example
α = 0.9
P(a|x) = 0.7, P(b|x) = 0.05, P(c|x) = 0.25
θ(x) =

Probabilistic partial reject [3, 7]
Assume we have p(y|x) as training output
Consider the permutation () on Y such that
p(y(1)
|x) ≥ p(y(2)
|x) ≥ . . . ≥ p(y(K)
|x)
Take all classes until cumulated probability is above
θ(x) = {y(1)
, . . . , y(j)
:
j−1
X
i=1
p(y(i)
|x) ≤ α,
j
X
i=1
p(y(i)
|x) ≥ α}
Example
α = 0.9
P(a|x) = 0.7 P(c|x) = 0.25 P(b|x) = 0.05
θ(x) = {a, c}

On probabilistic reject
The pros:
Rather straightforward to implement
Approximate coverage ensured if P calibrated8
The cons:
Difficult to differentiate ambiguity vs lack of knowledge
Badly estimated probabilities can lead to misleading conclusions
8
∃ techniques for global calibration, less for local.

(inductive) Conformal prediction [11]
Take a validation set9 of I instances (xi, yi)
To each yi associate a score αi = maxy6=yi
(p(y|xi) − p(yi|xi))
Given new instance xI+1, define p-value of each prediction yj as
pv(yj
) =
|{i = 1, . . . , I, I + 1 : αi ≥ αyj
}|
n + 1
.
with αyj
score of (xI+1, yj)
Get as prediction
θ(x) = {yj
: pv(yj
) ≥ 1 − α}.
9
6= training and test sets

Conformal prediction: example
α = 0.9
Assume 10 validation data with scores
−0.1; 0.3; −0.4; 0.1; 0; −0.6; −0.2; 0.2; 0.3; −0.1;
we observe an instance x
Let us test
θ(x) = {}

α = 0.9
−0.1; 0.3; −0.4; 0.1; 0; −0.6; −0.2; 0.2; 0.3; −0.1;
if (x, a) has score 0.5, pv(a) = 0/11 0.1
θ(x) = {}

α = 0.9
−0.1; 0.3; −0.4; 0.1; 0; −0.6; −0.2; 0.2; 0.3; −0.1;
if (x, b) has score -0.2, pv(b) = 8/11 0.1
θ(x) = {b}

α = 0.9
−0.1; 0.3; −0.4; 0.1; 0; −0.6; −0.2; 0.2; 0.3; −0.1;
if (x, c) has score 0, pv(b) = 5/11 0.1
θ(x) = {b, c}

On conformal prediction
The pros:
Provide global coverage guarantee10
Works on any score-base model (including deep ones), weak
theoretical requirements (exchangeability)
The cons:
Need validation set
May give fairly imprecise outputs if bad model/small validation set
Not a clear difference between aleatoric/epistemic aspects
10
Some works on conditional coverage exist.

Working with a set of model
Output is a set M of (probabilistic) models
For any m ∈ M, call m(x) its optimal prediction for x
Take as θ(x) all possibly optimal predictions
θ(x) = {m(x) : m ∈ M}

Example
X2
X1
a
θ2 (X1
a)
b θ1 (X2
b)
θ(x) = { , }
θ(x) = { , } θ(x) = { }
θ(x) = { }

On sets of models
The pros:
Approximate global coverage can be obtained11
Better control of imprecision
If m probabilistic, easier to distinguish aleatoric/epistemic
uncertainty
The cons:
Learning model has to be adapted → more or less painful
Decision rule can lead to complex optimisation
11
Conditional one harder, not much on that for now.

Outline
1 Basic setting
Imprecise data

The two doctors story
In a hospital, doctors get 1$ each time diagnostic is right.
2 Doctors pretty sure that patients have either Pneumonia (P) or Bronchitis
(B)
Doctor 1
Flip a coin each time
Diagnose the result
Gets 0.5$ in average
Doctor 2
Tells you he does not know b/w P
and B
Should his reward be 0.5 $, same
as doc 1? higher? lower?

Main solution so far for 0/1 loss
u(Ŷ, y) =



0 if y /
∈ Ŷ
α
|Ŷ|
+
1 − α
|Ŷ|2
otherwise
with u(Ŷ, y) = 1 if |Ŷ| = 1 and Ŷ = y
Discounted accuracy: α = 1
u(Ŷ, y) =
1
|Ŷ|
→ no reward to cautiousness (cautiousness≡randomness)
u65: α = 1.6, moderate reward to cautiousness
u80: α = 2.2, big reward to cautiousness
The higher α, the higher the reward
Solutions exists for generic losses too.

Boldness averseness illustrated
0 1/|Ŷ|
1
1/2
u50
u80
u65
0.8
0.65
0.5
2 classes predicted,
good one in it

References I
[1] Timothee Cour, Ben Sapp, and Ben Taskar.
Learning from partial labels.
Journal of Machine Learning Research, 12(May):1501–1536, 2011.
[2] Inés Couso and Luciano Sánchez.
Machine learning models, epistemic set-valued data and generalized loss functions: An encompassing approach.
Information Sciences, 358:129–150, 2016.
[3] Juan José del Coz, Jorge Díez, and Antonio Bahamonde.
Learning nondeterministic classifiers.
Journal of Machine Learning Research, 10(Oct):2273–2293, 2009.
[4] Thierry Denoeux.
Maximum likelihood estimation from uncertain data in the belief function framework.
IEEE Transactions on knowledge and data engineering, 25(1):119–130, 2013.
[5] D. Dubois, S. Moral, and H. Prade.
A semantics for possibility theory based on likelihoods,.
Journal of Mathematical Analysis and Applications, 205(2):359 – 380, 1997.
[6] Romain Guillaume, Inés Couso, and Didier Dubois.
Maximum likelihood with coarse data based on robust optimisation.
In Proceedings of the Tenth International Symposium on Imprecise Probability: Theories and Applications, pages 169–180,
2017.
[7] Thien M Ha.
The optimum class-selective rejection rule.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6):608–615, 1997.

References II
[8] Eyke Hüllermeier.
Learning from imprecise and fuzzy observations: Data disambiguation through generalized loss minimization.
International Journal of Approximate Reasoning, 55(7):1519–1534, 2014.
[9] Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh.
Wasserstein distributionally robust optimization: Theory and applications in machine learning.
In Operations Research Management Science in the Age of Analytics, pages 130–166. INFORMS, 2019.
[10] Vu-Linh Nguyen, Sébastien Destercke, and Eyke Hüllermeier.
Epistemic uncertainty sampling.
In International Conference on Discovery Science, pages 72–86. Springer, 2019.
[11] Harris Papadopoulos.
Inductive conformal prediction: Theory and application to neural networks.
In Tools in artificial intelligence. Citeseer, 2008.
[12] P. Walley.
Statistical reasoning with imprecise Probabilities.
Chapman and Hall, New York, 1991.
[13] Gen Yang, Sébastien Destercke, and Marie-Hélène Masson.
The costs of indeterminacy: How to determine them?
IEEE transactions on cybernetics, 47(12):4316–4327, 2017.
[14] M. Zaffalon.
The naive credal classifier.
J. Probabilistic Planning and Inference, 105:105–122, 2002.
[15] Marco Zaffalon, Giorgio Corani, and Denis Mauá.
Evaluating credal classifiers by utility-discounted predictive accuracy.
International Journal of Approximate Reasoning, 53(8):1282–1301, 2012.

References III

Imprecision in learning: an overview

More Related Content

What's hot

Similar to Imprecision in learning: an overview

Recently uploaded

Imprecision in learning: an overview