VSSML18. Evaluations

Valencian Summer School in Machine Learning
4th edition
September 13-14, 2018

BigML, Inc 2
Evaluations
Proving a Model Works
Poul Petersen
CIO, BigML, Inc

BigML, Inc 3Evaluations
Why Evaluations
• FACT: No model is perfect - they all make mistakes
• Your data has mistakes
• Models are “approximations”
• Today you have seen models that predict:
• Churn: How many people will churn that we didn’t predict?
• Diabetes: How many patients might have diabetes that we
said were ﬁne?
• Home Prices: How accurate are the predicted prices?
• You have also seen several different kinds of models
• Decision Trees / Ensembles / Logistic Regression /
Deepnets
• Which one works the best for your data

Easy Right?
INTL
MIN
INTL
CALLS
INTL
CHARGE
CUST
SERV
CALLS
CHURN
8,7 4 2,35 1 False
11,2 5 3,02 0 False
12,7 6 3,43 4 True
9,1 5 2,46 0 False
11,2 2 3,02 1 False
12,3 5 3,32 3 False
13,1 6 3,54 4 False
5,4 9 1,46 4 True
13,8 4 3,73 1 False
Model Prediction
PREDICT
CHURN
False
True
True
False
False
False
False
False
False
Look for Mistakes!

Evaluations Demo #1

What Just Happened?
• We started with the churn Datasource
• Created a Dataset
• Built a Model to predict churn
• We used the Model to predict churn for each customer in the
Dataset using a Batch Prediction
• Downloaded the Batch Prediction as a CSV and looked for
errors. That is, when the Prediction did not match the known
true value for churn
• The comparison was tedious!
• Examining one line at a time
• Hard to understand - need some metrics!!!

Evaluation Metrics
• Imagine we have a model that can predict a person’s dominant
hand, that is for any individual it predicts left / right
• Deﬁne the positive class
• This selection is arbitrary
• It is the class you are interested in!
• The negative class is the “other” class (or others)
• For this example, we choose : left

Evaluation Metrics
• We choose the positive class: left
• True Positive (TP)
• We predicted left and the correct answer was left
• True Negative (TN)
• We predicted right and the correct answer was right
• False Positive (FP)
• Predicted left but the correct answer was right
• False Negative (FN)
• Predict right but the correct answer was left

Evaluation Metrics
True Positive: Correctly predicted the positive class
True Negative: Correctly predicted the negative class
False Positive: Incorrectly predicted the positive class
False Negative: Incorrectly predicted the negative class
Remember…

Accuracy
TP + TN
Total
• “Percentage correct” - like an exam
• If Accuracy = 1 then no mistakes
• If Accuracy = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!
• Ex: 90% of people are right-handed and 10% are left
• A silly model which always predicts right handed is
90% accurate

Accuracy
Classiﬁed as
Left Handed
Classiﬁed as
Right Handed
TP = 0
FP = 0
TN = 7
FN = 3
= Left
= RightPositive

Class
Negative

Class TP + TN
Total
= 70%

Precision
TP
TP + FP
• “accuracy” or “purity” of positive class
• How well you did separating the positive class from the
negative class
• If Precision = 1 then no FP.
• You may have missed some left handers, but of the
ones you identiﬁed, all are left handed. No mistakes.
• If Precision = 0 then no TP
• None of the left handers you identiﬁed are actually left
handed. All mistakes.

Precision
Classiﬁed as
Left Handed
Classiﬁed as
Right Handed
TP = 2
FP = 2
TN = 5
FN = 1
Positive

Class
Negative

Class
= Left
= Right
TP
TP + FP
= 50%

Recall
TP
TP + FN
• percentage of positive class correctly identified
• A measure of how well you identified all of the positive
class examples
• If Recall = 1 then no FN → All left handers identified
• There may be FP, so precision could be <1
• If Recall = 0 then no TP → No left handers identified

Recall
Classiﬁed as
Left Handed
Classiﬁed as
Right Handed
TP = 2
FP = 2
TN = 5
FN = 1
Positive

Class
Negative

Class
= Left
= Right
TP
TP + FN
= 66%

f-Measure
2 * Recall * Precision
Recall + Precision
• harmonic mean of Recall & Precision
• If f-measure = 1 then Recall == Precision == 1
• If Precision OR Recall is small then the f-measure is small

f-Measure
Classiﬁed as
Fraud
Classiﬁed as
Not Fraud
R = 66%
P = 50%
f = 57%
Positive

Class
Negative

Class
= Left
= Right

Phi Coefﬁcient
__________TP*TN_-_FP*FN__________
SQRT[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]
• Returns a value between -1 and 1
• If -1 then predictions are opposite reality
• =0 no correlation between predictions and reality
• =1 then predictions are always correct

Phi Coefficient
Classified as
Fraud
Classified as
Not Fraud
TP = 2
FP = 2
TN = 5
FN = 1
Phi = 0.356
Positive

Class
Negative

Class
= Left
= Right

Evaluations Demo #2

What Just Happened?
• Starting with the Diabetes Source, we created a Dataset and
then a Model.
• Using both the Model and the original Dataset, we created an
Evaluation.
• We reviewed the metrics provided by the Evaluation:
• Confusion Matrix
• Accuracy, Precision, Recall, f-measure and
phi
• This Model seemed to perform really, really well…
Question: Can we trust this model?

Evaluation Danger!
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!

“Memorizing” Training Data
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
85 26,6 0,351 31 FALSE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Training Evaluating
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 ?
85 26,6 0,351 31 ?
• Exactly the same values!
• Who needs a model?
• What we want to know is how the
model performs with values never
seen at training:
124 22 0,107 46 ?

Evaluation Danger!
• If you only have one Dataset, use a train/test split

Train / Test Split
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Train Test
plasma
glucose
bmi
diabetes
pedigree
age diabetes
85 26,6 0,351 31 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE
• These instances were never seen
at training time.
• Better evaluation of how the
model will perform with “new” data

Train / Test Split
DATASET
TRAIN SET
TEST SET
PREDICTIONS
METRICS

Evaluation Danger!
• Even a train/test split may not be enough!
• Might get a “lucky” split
• Solution is to repeat several times (formally to cross validate)

Evaluations Demo #3

What Just Happened?
• Starting with the Diabetes Dataset we created a train/test split
• We built a Model using the train set and evaluated it with the
test set
• The scores were much worse than before, showing the danger
of evaluating with training data.
• Then we launched several other types of models and used the
evaluation comparison tool to see which model algorithm
performed the best.
Question:
Couldn’t we search for the best Model?
STAY
TUNED

Evaluation
• Even a train/test split may not be enough!
• Might get a “lucky” split
• Solution is to repeat several times (formally to cross validate)
• Don’t forget that accuracy can be mis-leading!
• Mostly useless with unbalanced classes (left/right?)
• Use weighting, operating points, other tricks…

Weighting
Instance Rate Payment Outcome Predict Confidence
1 23 % 134 Paid Paid 20 %
2 23 % 134 Paid Paid 25 %
3 23 % 134 Paid Paid 30 %
... ... ... ... ...
1000 23 % 134 Paid Paid 99,5 %
1001 23 % 134 Default Paid 99,4 %
Problem: Default is “more important”,but occurs less often
than Paid
Solution: Weights tell the model to treat instances of a
speciﬁc class (in this case Default) with more importance

Operating Points
• The default probability threshold is 50%
• Changing the threshold can change the outcome for a
speciﬁc class
Rate Payment …
Actual
Outcome
Probability
PAID
Threshold
@ 50%
Threshold
@ 60%
Threshold
@ 90%
8,4 % $456 … PAID 95 % PAID PAID PAID
9,6 % $134 … PAID 87 % PAID PAID DEFAULT
18 % $937 … DEFAULT 36 % DEFAULT DEFAULT DEFAULT
21 % $35 … PAID 88 % PAID PAID DEFAULT
17,5 % $1.044 … DEFAULT 55 % PAID DEFAULT DEFAULT

Lending Club Dataset
• Peer to Peer lending service
• As an investor, we want a way to
identify loans that are a lower risk
• Fortunately, the data for the outcome
(paid or default) for past loans is
available from Lending Club.
• Using this data, we can build a
model to predict which loans are
good or bad
Instance Rate Payment Outcome
1 8,4 % 456 Paid
2 9,6 % 134 Paid
3 18 % 937 Default
MODEL
NEW LOANS
GOOD / BAD

Evaluations Demo #4

What just happened?
• We split the Lending Club data into training and test Datasets
• We created a Model and Evaluation
• Looking at the Accuracy, we saw that the Model was
performing well but because of unbalanced classes
• The resulting Model did well at predicting good loans
• But bad loans are "more important"
• We tried different weights to increase the Recall of bad loans:
• objective balancing: equal consideration
• class weights: bad = 1000, good = 1
• Finally, we explored the impact of changing the probability
threshold
Wait - What about regressions?

Regression - Fitting a Line
Data Points
Model

Mean Absolute Error
e1
e2
e7
e6
e5
e4
e3
MAE = |e1|+|e2|+ … +|en|
n

Mean Squared Error
e1
e2
e7
e6
e5
e4
e3
MSE = (e1)2
+(e2)2
+ … +(en)2
n

MSE versus MAE
• For both MAE & MSE: Smaller is better, but values are
unbounded
• MSE is always larger than or equal to MAE

R-Squared Error
Data Points
Model
Mean

R-Squared Error
Mean
v1
v2
v3 v4 v5
v7
v6

R-Squared Error
e1
e2
e7
e6
e5
e4
e3
Mean
v1
v2
v3 v4 v5
v7
v6
MSEmodel
MSEmean
RSE = 1 -

R-Squared Error
• RSE: measure of how much better the model is than
always predicting the mean
• < 0 model is worse then mean
• MSEmodel > MSEmean
• = 0 model is no better than the mean
• MSEmodel = MSEmean
• ➞ 1 model ﬁts the data “perfectly”
• MSEmodel = 0 (or MSEmean >> MSEmodel)
MSEmodel
MSEmean
RSE = 1 -

Evaluations Demo #5

What just happened?
• We split the RedFin data into training and test Datasets
• We created a Model and Evaluation
• We examined the Evaluation metrics
Wait - What about Time Series?

BigML, Inc 46Data Transformations
Independent Data
Color Mass PPAP
red 11 pen
green 45 apple
red 53 apple
yellow 0 pen
blue 2 pen
green 422 pineapple
yellow 555 pineapple
blue 7 pen
Discovering patterns:
• Color = “red” Mass < 100
• PPAP = “pineapple” Color
≠ “blue”
• Color = “blue” PPAP =
“pen”

Independent Data
Color Mass PPAP
green 45 apple
blue 2 pen
green 422 pineapple
blue 7 pen
yellow 0 pen
yellow 9 pineapple
red 555 apple
red 11 pen
Patterns still hold when rows
re-arranged:
• Color = “red” Mass < 100
• PPAP = “pineapple” Color
≠ “blue”
• Color = “blue” PPAP =
“pen”

Dependent Data
Year Pineapple
Harvest1986 50,74
1987 22,03
1988 50,69
1989 40,38
1990 29,80
1991 9,90
1992 73,93
1993 22,95
1994 139,09
1995 115,17
1996 193,88
1997 175,31
1998 223,41
1999 295,03
2000 450,53
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Trend
Error

Dependent Data
Pineapple Harvest
Tons
0
125
250
375
500
Year
1986 1988 1990 1992 1994 1996 1998 2000
Year Pineapple
Harvest1986 139,09
1987 175,31
1988 9,91
1989 22,95
1990 450,53
1991 73,93
1992 40,38
1993 22,03
1994 295,03
1995 50,74
1996 29,8
1997 223,41
1998 115,17
1999 193,88
2000 50,69
Rearranging Disrupts Patterns

Random Train / Test Split
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Train Test
plasma
glucose
bmi
diabetes
pedigree
age diabetes
85 26,6 0,351 31 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE

Linear Train / Test Split
Train Test
Year Pineapple
Harvest1986 50,74
1987 22,03
1988 50,69
1989 40,38
1990 29,80
1991 9,90
1992 73,93
1993 22,95
1994 139,09
1995 115,17
1996 193,88
Year Pineapple
Harvest
1997 175,31
1998 223,41
1999 295,03
2000 450,53
Forecast
COMPARE

Evaluation Demo #6

VSSML18. Evaluations

Recommended

Recommended

More Related Content

Similar to VSSML18. Evaluations

Similar to VSSML18. Evaluations (20)

More from BigML, Inc

More from BigML, Inc (20)

Recently uploaded

Recently uploaded (20)

VSSML18. Evaluations