Py data19 final

How good is your prediction?
Quantifying uncertainty in Machine Learning predictions
PyData London 2019 (12th- 14th July)
Maria Navarro

Outline
Motivating example
Introduction to conformal predictions
Conformal predictions in classification
Conformal predictions in regression
Application
Summary and conclusions
References

Motivating example
Application
References

Motivating example
How good is your prediction?
Problem To find out whether a car is a total lo
To do it we have:
1. A set of historical observations 𝑥1; 𝑦1 , ⋯ , 𝑥 𝑁; 𝑦 𝑁 , where:
• 𝑥𝑖 describes the accident by age of the driver, model of the car, etc.
• 𝑦𝑖 is a label which identifies whether the car is reparable or not
2. A machine learning algorithm (h 𝑥 = 𝑦)
PROBLEM: To find out whether a car is a total loss or not

Motivating example
How good is your prediction, REALLY?
A new accident, 𝑥 𝑁+1, occurs. We run our model, and we obtain the following results:
1. The car is classified as total loss
2. The probability of total loss according to our model is 0.85
3. The model is roughly 91% accurate in training, test and validation sets, so we expect same
behaviour in production data
4. The model has an AUC of 0.88 in training, so again that is what we expect in production data
What do these measurements mean?
Do we have any guarantee about accident 𝑥 𝑁+1?
Are we confident about the prediction?

Why Conformal Predictions (CP) ?
1. There are several ad hoc ways to obtain some confidence around your predictions (resampling
methods, assume normality, etc.)
2. Conformal predictions assumes very little about the outcome you are trying to predict. It only
assume exchangeability.
3. It can be used with any machine learning algorithm.
4. It provides error bounds at a confidence level that we can select.
5. Probabilities are well-calibrated.
6. It is easy to implement.
7. The framework has been proven:
V. Vovk, A. Gammerman, G. Shafer
Algorithmic learning in a random walk, Springer 2005.

General idea
• Let 𝑍 be a probability distribution.
• f z → ℝ some function.
• We draw 5 samples from the distribution 𝑍 and apply 𝑓 𝑧 :
 𝑓 𝑧𝑖 = 𝛼𝑖, with 𝑖 = 1, … , 5
 For simplicity, we assume 𝛼1 ≤ 𝛼2 ≤ 𝛼3 ≤ 𝛼4 ≤ 𝛼5
• We estimate the cumulative distribution function (CDF) for the scores:
0 0.2 0.4 0.6 0.8 1
𝛼1 𝛼2 𝛼3 𝛼4 𝛼5
• We draw a new sample from z ∈ 𝑍. We assume exchangeability and compute 𝑓 𝑧 = 𝛼.
• We can estimate its probability: 𝑃 𝛼 ≤ 𝛼4 = 0.6 and 𝑃 𝛼 ≤ 𝛼2 = 0.2

Relation to our problem
• Let 𝑧𝑖 = 𝑥𝑖; 𝑦𝑖 with 𝑖 = 1, … , 𝑝 be a sample of the probability distribution, 𝑍 = 𝑋, 𝑌 , where:
 𝑥𝑖 is our observables and 𝑦𝑖 the target we want to predict
• We define 𝑓 𝑧𝑖 = 𝑦𝑖 − ℎ 𝑥𝑖 , where:
 ℎ 𝑥𝑖 is a regression model train on 𝑧𝑖 with 𝑖 = 5, … , 𝑝
• We apply 𝑓 𝑧 to the 5 remaining samples
 𝑓 𝑧𝑖 = 𝛼𝑖, with 𝑖 = 1, … , 5
 We can compute the exact values 0.10 ≤ 0.13 ≤ 0.28 ≤ 0.30 ≤ 0.38
• We estimate the cumulative distribution function (CDF) for the scores:
0 0.2 0.4 0.6 0.8 1
0.10 0.13 0.28 0.30 0.38
• We draw a new sample from z ∈ 𝑍. We assume exchangeability and compute 𝑓 𝑧 = 𝑦 − ℎ 𝑥 = 𝑦 − 2 .
• We can estimate its probability:
 𝑃 𝑦 − 2 ≤ 0.30 = 0.6 and 𝑃 𝑦 − 2 ≤ 0.28 = 0.4
 𝑃 𝑦 ∈ 2 ± 0.30 = 0.6 and 𝑃 𝑦 ∈ 2 ± 0.30 = 0.4
 𝑦 𝜖 1.7, 2.3 with probability 0.6

Inputs for conformal predictions
• A set of training examples 𝑧𝑖 = 𝑥𝑖, 𝑦𝑖 with 𝑖 = 1, … , 𝑃
 They must be drawn from an exchangeable distribution (the order of observations is
irrelevant).
• A non-conformity function 𝑓 𝑧 → ℝ
 It measures the “weirdness” of an example 𝑥𝑖, 𝑦𝑖
 It should give low scores to similar examples 𝑥𝑖, 𝑦𝑖 and high scores to different ones
𝑥𝑖, ¬𝑦𝑖
 Common choice is take some function of the underlying model, but it can be anything: the
probability estimate for correct class, distance to neighbours with same class, probability from
the trees, absolute error of a regression model, etc.
• Set a significance level 𝛆 ∈ (0,1), so 1 − 𝜀 confidence level

How does conformal predictions work?
• Divide training set into two disjoint sets: 𝑍𝑡 with 𝑍𝑡 = 𝑚 and 𝑍 𝑐 with 𝑍 𝑐 = 𝑛, 𝑚 + 𝑛 = 𝑝
• Build the underlying model, ℎ, using 𝑍𝑡
• Apply 𝑓 𝑧𝑖 = 𝛼𝑖 to the elements of the set you did not use for training ℎ , and estimate its probability
distribution 𝛼1, … , 𝛼 𝑛 ~ 𝑄
• If a new example comes in 𝑥, ℎ 𝑥 = 𝑦 , then we will reject 𝑦
 We will reject 𝑦 if 𝑓 (𝑥, 𝑦) = 𝛼 𝑦 does not belong to 𝑄
• We compute the non-conformity degree which is called p-value as follows:
𝑝 𝑦=
𝑧 𝑗 𝜖 𝑍 𝑐∶ 𝛼 𝑗 ≥ 𝛼 𝑦
𝑛+1
, 𝑝 𝑦 is the p-value
• Finally the prediction region:
Γ 𝜀
= 𝑦 𝜖 𝑌: 𝑝 𝑦 > 𝜀
Is 𝒚 a very non-conforming example?

Conformal prediction output
The prediction region Γ 𝜀
contains prediction 𝑦 with probability 1 − 𝜀
 In classification :
 𝛼 𝑦 is know, but we need to compute 𝑝 𝑦
 The result is a set of labels:
Γ 𝜀
= 𝐶𝑙𝑎𝑠𝑠1, 𝐶𝑙𝑎𝑠𝑠3, 𝐶𝑙𝑎𝑠𝑠5 s. t. 𝑃 𝑦 ∈ Γ 𝜀
= 1 − 𝜀
o If Γ 𝜀
= ∅ , then always erroneous
o If Γ 𝜀
= 𝐶 (only one class), then always true (if it is the correct class)
o If Γ 𝜀
= 𝐶𝑙𝑎𝑠𝑠1, 𝐶𝑙𝑎𝑠𝑠3, … , 𝐶𝑙𝑎𝑠𝑠5 (several classes), then always correct
 In regression is an interval:
 𝑝 𝑦 is know, but we need to compute 𝛼 𝑦
 The result is an interval:
Γ 𝜀
= 𝑎, 𝑏 where 𝑎, 𝑏 ∈ ℝ and s. t. 𝑃 𝑦 ∈ Γ 𝜀
= 1 − 𝜀

Algorithm to compute conformal prediction regions in classification problems
Let 𝑍 = 𝑋, 𝑌 be the historical data set for our classification problem, where:
 𝑍 = 𝑝, 𝑋 is the information about the problem and 𝑌 = 𝐶1 , … , 𝐶𝑠 set of labels.
 𝑍 is exchangeable.
To obtain the prediction region:
1. Divide 𝑍 into two disjoint sets:
 𝑍𝑡 proper training set with 𝑍𝑡 = 𝑚
 𝑍 𝑐 calibration set with 𝑍 𝑐 = 𝑛
2. Fit a classifier, ℎ 𝑋 = 𝑌, using 𝑍𝑡
3. Define a non-conformity function 𝑓 𝑧 to measure the weirdness of your samples
4. Apply 𝑓 𝑧 to each element in 𝑍 𝑐 to obtain the calibration scores: 𝛼1, … , 𝛼 𝑛
5. Set a significance level 𝜀 𝜖 0, 1

Algorithm to compute conformal predictions in classification problems
6. For a new sample 𝑥, 𝑦 compute the scoring value for each label in 𝑌:
∀ 𝐶𝑖 𝜖 𝑌 𝑓 𝑥, 𝑦 = 𝐶𝑖 = 𝛼 𝐶 𝑖
7. For each label in 𝑌 compute the p-value as follows:
∀ 𝐶𝑖 𝜖 𝑌 𝑝 𝐶 𝑖
=
𝑧 𝑗 𝜖 𝑍 𝑐∶ 𝛼 𝑗 ≥𝛼 𝐶 𝑖
𝑛+1
8. Finally build the prediction region as follows:
Γ 𝜀
= 𝐶𝑖 𝜖 𝑌: 𝑝 𝐶 𝑖
> 𝜀 , then
for the new prediction ℎ 𝑥 = 𝑦, 𝑃 𝑦 𝜖 Γ 𝜀
= 1 − ε

Algorithm to compute conformal prediction regions in regression problems
Let 𝑍 = 𝑋, 𝑌 be the historical data set for our classification problem, where:
 𝑍 = 𝑝, 𝑋 is the information about the problem and 𝑌 a continuous target.
 𝑍 is exchangeable.
To obtain the prediction region:
1. Divide 𝑍 into two disjoint sets:
 𝑍𝑡 proper training set with 𝑍𝑡 = 𝑚
 𝑍 𝑐 calibration set with 𝑍 𝑐 = 𝑛
2. Fit a regression model, ℎ 𝑋 = 𝑌, using 𝑍𝑡
3. Define a non-conformity function 𝑓 𝑧 to measure the weirdness of your samples
4. Apply 𝑓 𝑧 to each element in 𝑍 𝑐 to obtain the calibration scores: 𝛼1, … , 𝛼 𝑛
5. Set a significance level 𝜀 𝜖 0, 1

Algorithm to compute conformal predictions in regression problems
6. Sort calibrations scores 𝛼1, … , 𝛼 𝑛 in a descending order
7. Compute the index 𝑠 = 𝜀 𝑛 + 1
 This is the index of the (1 − ε)-percentile of the non-conformity score 𝛼 𝑠
8. Finally the prediction region for a new sample:
Γ 𝜀
= ℎ 𝑥𝑖 ± 𝛼 𝑠, with 𝑃 ℎ(𝑥𝑖)𝜖 Γ 𝜀
= 1 − ε

Application
Classification with conformal predictors
• The dataset is imbalanced (Total Loss is the minority class)
• The model is XGBoost
• Model performance:
• A new accident happens the model says it is a Total Loss, but how confident we are?
• Due to business restrictions we have to minimize the number false positives in TL
PROBLEM: To find out whether a car is a total loss or not

Application
• We take the test set, 𝑍𝑡𝑒𝑠𝑡 = (𝑥𝑖, 𝑦𝑖) with 𝑖 = 1, … , 𝑀
• We define a non-conformity function:
𝑓 𝑧 =
𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖 + 𝑐𝑎𝑙𝑖𝑏𝑟𝑎𝑡𝑒𝑑 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 𝑖
2
where:
 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑐𝑙𝑎𝑠𝑠 𝑖 according to the model that 𝑦 = 𝑐𝑙𝑎𝑠𝑠 𝑖
 𝑐𝑎𝑙𝑖𝑏𝑟𝑎𝑡𝑒𝑑 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑐𝑙𝑎𝑠𝑠 𝑖 recalibrated probability that 𝑦 = 𝑐𝑙𝑎𝑠𝑠 𝑖

Application
• Let us assume 𝑀 = 9 and apply 𝑓 𝑧 to each 𝑧𝑖 𝜖 𝑍𝑡𝑒𝑠𝑡
• We order the scores, and use them to compute the p-value per label for the new accident:
TL = 0.85 p-value TL = 8/(9+1) = 0.8 > 𝜀 = 0.05
Non-TL = 0.15 p-value non-TL = 2/(9+1) = 0.2 > 𝜀 = 0.05
Γ 𝜀
= 𝑇𝐿, 𝑛𝑜𝑛 − 𝑇𝐿 s. t. 𝑃 𝑦 ∈ Γ 𝜀
= 0.95

Application

Application
Classification with conformal predictions

Application
Regression with conformal predictors
• The dataset is not correctly label there were some inconsistencies.
• The model is XGBoost.
• Model performance:
• The model output was the input to another model
PROBLEM: to compute/find out the price of a car

Application
Regression with conformal predictions

Application
Regression with conformal predictors
• We take the test set, 𝑍𝑡𝑒𝑠𝑡 = (𝑥𝑖, 𝑦𝑖) with 𝑖 = 1, … , 𝑀
• We define a non-conformity function:
𝑓 𝑧 = 𝑦 − ℎ(𝑥)
where:
 𝑦 is the true value, and ℎ(𝑥) the model prediction
• Let us assume 𝑀 = 9 and apply 𝑓 𝑧 to each 𝑧𝑖 𝜖 𝑍𝑡𝑒𝑠𝑡
• We order in descending order
• We set 𝜀 = 0.2, then the index of the score 𝑠 = 0.2 ∙ 9 + 1 = 2 𝛼 𝑠=2
• The fixed width conformal interval would be: ℎ(𝑥) ± 189.52

Take away
• Good model performance does not mean trustable predictions.
• Conformal predictions is a useful tool with different applications.
• It is easy to understand and to implement.
• Define a non-conformity function is not always easy.
• Confident areound predictions bring some

References
Some interesting readings
1. V. Vovk, A. Gammerman, G. Shafer, Algorithm learning in a random walk, Springer, 2005.
2. H. Linusson, An introduction to conformal predictions, 2017.
3. V. Vovk, Cross-conformal predictors, Annals of Mathematics and Artificial Intelligence, 1-20, 2013.
4. U. Johannsson, H. Bostrom, T. Lofstrom, H. Linusson, Regression conformal predictors with
random forest, Machine Learning, 95, 155-176, 2014.
5. V. Balasubramanian, S-S. Ho, V. Vovk, Conformal predictions for reliable machine learning, Science
Direct Journal and Book, 2014.

How is your prediction? Quantifying uncertainty in Machine Learning
predictions
Questions

Py data19 final

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Py data19 final

Similar to Py data19 final (20)

Recently uploaded

Recently uploaded (20)

Py data19 final