MLSEV Virtual. Evaluations

#MLSEV 2
Evaluations
All models are wrong, but some are useful
Charles Parker
VP Algorithms, BigML, Inc

#MLSEV 3
My Model Is Wonderful
• I trained a model on my data and it
seems really marvelous!
• How do you know for sure?
• To quantify your model’s
performance, you must evaluate it
• This is not optional. If you don’t
do this and do it right, you’ll have
problems

#MLSEV 4
Proper Evaluation
• Choosing the right metric
• Testing on the right data (which might be harder than you think)
• Replicating your tests

#MLSEV 6
Proper Evaluation
• The most basic workflow for model evaluation is:
• Split your data into two sets, training and testing
• Train a model on the training data
• Measure the “performance” of the model on the testing data
• If your training data is representative of what you will see in the future, that’s
the performance you should get out of your model
• What do we mean by “performance”? This is where you come in.

#MLSEV 7
Medical Testing Example
• Let’s say we develop an ML model that can
diagnose a disease
• About 1 in 1000 people who are tested by
the model turn out to have the disease
• Call the people who have the disease
“sick” and people who don’t have it “well”.
• How well do we do on a test set?

#MLSEV 8
Some Terminology
We’ll define the sick people as “positive” and the well people as “negative"
• “True Positive”: You’re sick and the model diagnosed you as sick
• “False Positive”: You’re well, but the model diagnosed you as sick
• “True Negative”: You’re well, and the model diagnosed you as well
• “False Negative”: You’re sick, but the model diagnosed you as well
The model is correct in the “true” cases, and incorrect in the “false” cases

#MLSEV 9
Accuracy
TP + TN
Total
• “Percentage correct” - like an exam
• If Accuracy = 1 then no mistakes
• If Accuracy = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!
• Remember, only 1 in 1000 have the disease
• A silly model which always predicts “well” is 99.9% accurate

#MLSEV 10
Precision
Predicted “Well”
Predicted “Sick”
• How well did we do when we predicted
someone was sick?
• A test with high precision has few false
positives
• Precision of 1.0 indicates that everyone who
we predict is sick is actually sick
• What about people who we predict are well?
TP
TP + FP
= 0.6
Sick Person
Well Person

#MLSEV 11
Recall
Predicted “Well”
Predicted “Sick”
• How well did we do when someone was
actually sick?
• A test with high recall indicates few false
negatives
• Recall of 1.0 indicates that everyone who was
actually sick was correctly diagnosed
• But this doesn’t say anything about false
positives!
TP
TP + FN
= 0.75
Sick Person
Well Person

#MLSEV 12
Trade Oﬀs
• We can “trivially maximize” both measures
• If you pick the sickest person and only label them sick and no one
else, you can probably get perfect precision
• If you label everyone sick, you are guaranteed perfect recall
• The unfortunate catch is that if you make one perfect, the
other is terrible, so you want a model that has both high
precision and recall
• This is what quantities like the F1 score and Phi
Coefficient try to do

#MLSEV 13
Cost Matrix
• In many cases, the consequences of a true
positive and a false positive are very different
• You can define “costs” for each type of mistake
• Total Cost = TP * TP_Cost + FP * FP_Cost
• Here, we are willing to accept lots of false
positives in exchange for high recall
• What if a positive diagnosis resulted in
expensive or painful treatment?
Classiﬁed
Sick
Classiﬁed
Well
Actually
Sick
0 100
Actually
Well
1 0
Cost matrix for medical
diagnosis problem

#MLSEV 14
Operating Thresholds
• Most classifiers don’t output a prediction. Instead they give a “score” for each
class
• The prediction you assign to an instance is usually a function of a threshold on
this score (e.g., if the score is over 0.5, predict true)
• You can experiment with an ROC curve to see how your metrics will change if
you change the threshold
• Lowering the threshold means you are more likely to predict the positive class, which improves
recall but introduces false positives
• Increasing the threshold means you predict the positive class less often (you are more “picky”),
which will probably increase precision but lower recall.

#MLSEV 17
Why Hold Out Data?
• Why do we split the dataset into training and testing sets? Why do we always
(always, always) test on data that the model training process did not see?
• Because machine learning algorithms are good at memorizing data
• We don’t care how well the model does on data it has already seen because it
probably won’t see that data again
• Holding out some of the test data is simulating the data the model will see in
the future

#MLSEV 18
Memorization
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
85 26,6 0,351 31 FALSE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Training Evaluating
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 ?
85 26,6 0,351 31 ?
• You don’t even need meaningful features;
the person’s name would be enough
• “Oh right, Bob. I know him. Yes, he
certainly has diabetes”
• As long as there are no duplicate names
in the dataset, it's a 100% accurate
model

#MLSEV 19
Well, That Was Easy
• Okay, so I’m not testing on the training
data, so I’m good, right? NO NO NO
• You also have to worry about information
leakage between training and test data.
• What is this? Let’s try to predict the daily
closing price of the stock market
• What happens if you hold out 10 random
days from your dataset?
• What if you hold out the last 10 days?

#MLSEV 20
Traps Everywhere!
• This is common when you have time-distributed
data, but can also happen in other instances:
• Let’s say we have a dataset of 10,000 pictures
from 20 people, each labeled with the year it which
it was taken
• We want to predict the year from the image
• What happens if we hold out random data?
• Solution: Hold out users instead

#MLSEV 21
How Do We Avoid This?
• It’s a terrible problem, because if you make the mistake you will get results
that are too good, and be inclined to believe them
• So be careful? Do you have:
• Data where points can be grouped in time (by week or by month)?
• Data where points can be grouped by user (each point is an action a user took)
• Data where points can be grouped by location (each point is a day of sales at a particular store)
• Even if you’re suspicious that points from the group might leak information to
one another, try a test where you hold out a few groups (months, users,
locations) and train on the rest

#MLSEV 23
One Test is Not Enough
• Even if you have a correct holdout, you still need to test more than once.
• Every result you get from any test is a result of randomness
• Randomness from the Data:
• The dataset you have is a finite number of points drawn from an infinite distribution
• The split you make between training and test data is done at random
• Randomness of the algorithm
• The ordering of the data might give different results
• The best performing algorithms (random forests, deepnets) have randomness built-in
• With just one result, you might get lucky

#MLSEV 24
Performance
Really nice result!

#MLSEV 25
Performance
Really nice result!
Likelihood
But really just a lucky one

#MLSEV 26
Comparing Models is Even Worse

#MLSEV 27

#MLSEV 28
First digit of

random seed

#MLSEV 29
Please, Sir, Can I Have Some More?
• Always do more than one test!
• For each test, try to vary all sources of
randomness that you can (change the seeds of all
random processes) to try to “experience” as much
variance as you can
• Cross-validation (stratifying is great, monte-carlo
can be a useful simplification)
• Don’t just average the results! The variance is
important!

#MLSEV 30
Summing Up
• Choose the metric that makes sense for
your problem
• Use held out data for testing and watch out
for information leakage
• Always do more than one test, varying all
sources of randomness that you have
control over!

MLSEV Virtual. Evaluations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MLSEV Virtual. Evaluations

Similar to MLSEV Virtual. Evaluations (20)

More from BigML, Inc

More from BigML, Inc (20)

Recently uploaded

Recently uploaded (20)

MLSEV Virtual. Evaluations