2nd edition
#MLSEV 2
Evaluations
All models are wrong, but some are useful
Charles Parker
VP Algorithms, BigML, Inc
#MLSEV 3
My Model Is Wonderful
• I trained a model on my data and it
seems really marvelous!
• How do you know for sure?
• To quantify your model’s
performance, you must evaluate it
• This is not optional. If you don’t
do this and do it right, you’ll have
problems
#MLSEV 4
Proper Evaluation
• Choosing the right metric
• Testing on the right data (which might be harder than you think)
• Replicating your tests
#MLSEV 5
Metric Choice
#MLSEV 6
Proper Evaluation
• The most basic workflow for model evaluation is:
• Split your data into two sets, training and testing
• Train a model on the training data
• Measure the “performance” of the model on the testing data
• If your training data is representative of what you will see in the future, that’s
the performance you should get out of your model
• What do we mean by “performance”? This is where you come in.
#MLSEV 7
Medical Testing Example
• Let’s say we develop an ML model that can
diagnose a disease
• About 1 in 1000 people who are tested by
the model turn out to have the disease
• Call the people who have the disease
“sick” and people who don’t have it “well”.
• How well do we do on a test set?
#MLSEV 8
Some Terminology
We’ll define the sick people as “positive” and the well people as “negative"
• “True Positive”: You’re sick and the model diagnosed you as sick
• “False Positive”: You’re well, but the model diagnosed you as sick
• “True Negative”: You’re well, and the model diagnosed you as well
• “False Negative”: You’re sick, but the model diagnosed you as well
The model is correct in the “true” cases, and incorrect in the “false” cases
#MLSEV 9
Accuracy
TP + TN
Total
• “Percentage correct” - like an exam
• If Accuracy = 1 then no mistakes
• If Accuracy = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!
• Remember, only 1 in 1000 have the disease
• A silly model which always predicts “well” is 99.9% accurate
#MLSEV 10
Precision
Predicted “Well”
Predicted “Sick”
• How well did we do when we predicted
someone was sick?
• A test with high precision has few false
positives
• Precision of 1.0 indicates that everyone who
we predict is sick is actually sick
• What about people who we predict are well?
TP
TP + FP
= 0.6
Sick Person
Well Person
#MLSEV 11
Recall
Predicted “Well”
Predicted “Sick”
• How well did we do when someone was
actually sick?
• A test with high recall indicates few false
negatives
• Recall of 1.0 indicates that everyone who was
actually sick was correctly diagnosed
• But this doesn’t say anything about false
positives!
TP
TP + FN
= 0.75
Sick Person
Well Person
#MLSEV 12
Trade Offs
• We can “trivially maximize” both measures
• If you pick the sickest person and only label them sick and no one
else, you can probably get perfect precision
• If you label everyone sick, you are guaranteed perfect recall
• The unfortunate catch is that if you make one perfect, the
other is terrible, so you want a model that has both high
precision and recall
• This is what quantities like the F1 score and Phi
Coefficient try to do
#MLSEV 13
Cost Matrix
• In many cases, the consequences of a true
positive and a false positive are very different
• You can define “costs” for each type of mistake
• Total Cost = TP * TP_Cost + FP * FP_Cost
• Here, we are willing to accept lots of false
positives in exchange for high recall
• What if a positive diagnosis resulted in
expensive or painful treatment?
Classified
Sick
Classified
Well
Actually
Sick
0 100
Actually
Well
1 0
Cost matrix for medical
diagnosis problem
#MLSEV 14
Operating Thresholds
• Most classifiers don’t output a prediction. Instead they give a “score” for each
class
• The prediction you assign to an instance is usually a function of a threshold on
this score (e.g., if the score is over 0.5, predict true)
• You can experiment with an ROC curve to see how your metrics will change if
you change the threshold
• Lowering the threshold means you are more likely to predict the positive class, which improves
recall but introduces false positives
• Increasing the threshold means you predict the positive class less often (you are more “picky”),
which will probably increase precision but lower recall.
#MLSEV 15
ROC Curve Example
#MLSEV 16
Holding Out Data
#MLSEV 17
Why Hold Out Data?
• Why do we split the dataset into training and testing sets? Why do we always
(always, always) test on data that the model training process did not see?
• Because machine learning algorithms are good at memorizing data
• We don’t care how well the model does on data it has already seen because it
probably won’t see that data again
• Holding out some of the test data is simulating the data the model will see in
the future
#MLSEV 18
Memorization
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
85 26,6 0,351 31 FALSE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Training Evaluating
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 ?
85 26,6 0,351 31 ?
• You don’t even need meaningful features;
the person’s name would be enough
• “Oh right, Bob. I know him. Yes, he
certainly has diabetes”
• As long as there are no duplicate names
in the dataset, it's a 100% accurate
model
#MLSEV 19
Well, That Was Easy
• Okay, so I’m not testing on the training
data, so I’m good, right? NO NO NO
• You also have to worry about information
leakage between training and test data.
• What is this? Let’s try to predict the daily
closing price of the stock market
• What happens if you hold out 10 random
days from your dataset?
• What if you hold out the last 10 days?
#MLSEV 20
Traps Everywhere!
• This is common when you have time-distributed
data, but can also happen in other instances:
• Let’s say we have a dataset of 10,000 pictures
from 20 people, each labeled with the year it which
it was taken
• We want to predict the year from the image
• What happens if we hold out random data?
• Solution: Hold out users instead
#MLSEV 21
How Do We Avoid This?
• It’s a terrible problem, because if you make the mistake you will get results
that are too good, and be inclined to believe them
• So be careful? Do you have:
• Data where points can be grouped in time (by week or by month)?
• Data where points can be grouped by user (each point is an action a user took)
• Data where points can be grouped by location (each point is a day of sales at a particular store)
• Even if you’re suspicious that points from the group might leak information to
one another, try a test where you hold out a few groups (months, users,
locations) and train on the rest
#MLSEV 22
Do It Again!
#MLSEV 23
One Test is Not Enough
• Even if you have a correct holdout, you still need to test more than once.
• Every result you get from any test is a result of randomness
• Randomness from the Data:
• The dataset you have is a finite number of points drawn from an infinite distribution
• The split you make between training and test data is done at random
• Randomness of the algorithm
• The ordering of the data might give different results
• The best performing algorithms (random forests, deepnets) have randomness built-in
• With just one result, you might get lucky
#MLSEV 24
One Test is Not Enough
Performance
Really nice result!
#MLSEV 25
One Test is Not Enough
Performance
Really nice result!
Likelihood
But really just a lucky one
#MLSEV 26
Comparing Models is Even Worse
#MLSEV 27
Comparing Models is Even Worse
#MLSEV 28
Comparing Models is Even Worse
First digit of 

random seed
#MLSEV 29
Please, Sir, Can I Have Some More?
• Always do more than one test!
• For each test, try to vary all sources of
randomness that you can (change the seeds of all
random processes) to try to “experience” as much
variance as you can
• Cross-validation (stratifying is great, monte-carlo
can be a useful simplification)
• Don’t just average the results! The variance is
important!
#MLSEV 30
Summing Up
• Choose the metric that makes sense for
your problem
• Use held out data for testing and watch out
for information leakage
• Always do more than one test, varying all
sources of randomness that you have
control over!
MLSEV Virtual. Evaluations

MLSEV Virtual. Evaluations

  • 1.
  • 2.
    #MLSEV 2 Evaluations All modelsare wrong, but some are useful Charles Parker VP Algorithms, BigML, Inc
  • 3.
    #MLSEV 3 My ModelIs Wonderful • I trained a model on my data and it seems really marvelous! • How do you know for sure? • To quantify your model’s performance, you must evaluate it • This is not optional. If you don’t do this and do it right, you’ll have problems
  • 4.
    #MLSEV 4 Proper Evaluation •Choosing the right metric • Testing on the right data (which might be harder than you think) • Replicating your tests
  • 5.
  • 6.
    #MLSEV 6 Proper Evaluation •The most basic workflow for model evaluation is: • Split your data into two sets, training and testing • Train a model on the training data • Measure the “performance” of the model on the testing data • If your training data is representative of what you will see in the future, that’s the performance you should get out of your model • What do we mean by “performance”? This is where you come in.
  • 7.
    #MLSEV 7 Medical TestingExample • Let’s say we develop an ML model that can diagnose a disease • About 1 in 1000 people who are tested by the model turn out to have the disease • Call the people who have the disease “sick” and people who don’t have it “well”. • How well do we do on a test set?
  • 8.
    #MLSEV 8 Some Terminology We’lldefine the sick people as “positive” and the well people as “negative" • “True Positive”: You’re sick and the model diagnosed you as sick • “False Positive”: You’re well, but the model diagnosed you as sick • “True Negative”: You’re well, and the model diagnosed you as well • “False Negative”: You’re sick, but the model diagnosed you as well The model is correct in the “true” cases, and incorrect in the “false” cases
  • 9.
    #MLSEV 9 Accuracy TP +TN Total • “Percentage correct” - like an exam • If Accuracy = 1 then no mistakes • If Accuracy = 0 then all mistakes • Intuitive but not always useful • Watch out for unbalanced classes! • Remember, only 1 in 1000 have the disease • A silly model which always predicts “well” is 99.9% accurate
  • 10.
    #MLSEV 10 Precision Predicted “Well” Predicted“Sick” • How well did we do when we predicted someone was sick? • A test with high precision has few false positives • Precision of 1.0 indicates that everyone who we predict is sick is actually sick • What about people who we predict are well? TP TP + FP = 0.6 Sick Person Well Person
  • 11.
    #MLSEV 11 Recall Predicted “Well” Predicted“Sick” • How well did we do when someone was actually sick? • A test with high recall indicates few false negatives • Recall of 1.0 indicates that everyone who was actually sick was correctly diagnosed • But this doesn’t say anything about false positives! TP TP + FN = 0.75 Sick Person Well Person
  • 12.
    #MLSEV 12 Trade Offs •We can “trivially maximize” both measures • If you pick the sickest person and only label them sick and no one else, you can probably get perfect precision • If you label everyone sick, you are guaranteed perfect recall • The unfortunate catch is that if you make one perfect, the other is terrible, so you want a model that has both high precision and recall • This is what quantities like the F1 score and Phi Coefficient try to do
  • 13.
    #MLSEV 13 Cost Matrix •In many cases, the consequences of a true positive and a false positive are very different • You can define “costs” for each type of mistake • Total Cost = TP * TP_Cost + FP * FP_Cost • Here, we are willing to accept lots of false positives in exchange for high recall • What if a positive diagnosis resulted in expensive or painful treatment? Classified Sick Classified Well Actually Sick 0 100 Actually Well 1 0 Cost matrix for medical diagnosis problem
  • 14.
    #MLSEV 14 Operating Thresholds •Most classifiers don’t output a prediction. Instead they give a “score” for each class • The prediction you assign to an instance is usually a function of a threshold on this score (e.g., if the score is over 0.5, predict true) • You can experiment with an ROC curve to see how your metrics will change if you change the threshold • Lowering the threshold means you are more likely to predict the positive class, which improves recall but introduces false positives • Increasing the threshold means you predict the positive class less often (you are more “picky”), which will probably increase precision but lower recall.
  • 15.
  • 16.
  • 17.
    #MLSEV 17 Why HoldOut Data? • Why do we split the dataset into training and testing sets? Why do we always (always, always) test on data that the model training process did not see? • Because machine learning algorithms are good at memorizing data • We don’t care how well the model does on data it has already seen because it probably won’t see that data again • Holding out some of the test data is simulating the data the model will see in the future
  • 18.
    #MLSEV 18 Memorization plasma glucose bmi diabetes pedigree age diabetes 14833,6 0,627 50 TRUE 85 26,6 0,351 31 FALSE 183 23,3 0,672 32 TRUE 89 28,1 0,167 21 FALSE 137 43,1 2,288 33 TRUE 116 25,6 0,201 30 FALSE 78 31 0,248 26 TRUE 115 35,3 0,134 29 FALSE 197 30,5 0,158 53 TRUE Training Evaluating plasma glucose bmi diabetes pedigree age diabetes 148 33,6 0,627 50 ? 85 26,6 0,351 31 ? • You don’t even need meaningful features; the person’s name would be enough • “Oh right, Bob. I know him. Yes, he certainly has diabetes” • As long as there are no duplicate names in the dataset, it's a 100% accurate model
  • 19.
    #MLSEV 19 Well, ThatWas Easy • Okay, so I’m not testing on the training data, so I’m good, right? NO NO NO • You also have to worry about information leakage between training and test data. • What is this? Let’s try to predict the daily closing price of the stock market • What happens if you hold out 10 random days from your dataset? • What if you hold out the last 10 days?
  • 20.
    #MLSEV 20 Traps Everywhere! •This is common when you have time-distributed data, but can also happen in other instances: • Let’s say we have a dataset of 10,000 pictures from 20 people, each labeled with the year it which it was taken • We want to predict the year from the image • What happens if we hold out random data? • Solution: Hold out users instead
  • 21.
    #MLSEV 21 How DoWe Avoid This? • It’s a terrible problem, because if you make the mistake you will get results that are too good, and be inclined to believe them • So be careful? Do you have: • Data where points can be grouped in time (by week or by month)? • Data where points can be grouped by user (each point is an action a user took) • Data where points can be grouped by location (each point is a day of sales at a particular store) • Even if you’re suspicious that points from the group might leak information to one another, try a test where you hold out a few groups (months, users, locations) and train on the rest
  • 22.
  • 23.
    #MLSEV 23 One Testis Not Enough • Even if you have a correct holdout, you still need to test more than once. • Every result you get from any test is a result of randomness • Randomness from the Data: • The dataset you have is a finite number of points drawn from an infinite distribution • The split you make between training and test data is done at random • Randomness of the algorithm • The ordering of the data might give different results • The best performing algorithms (random forests, deepnets) have randomness built-in • With just one result, you might get lucky
  • 24.
    #MLSEV 24 One Testis Not Enough Performance Really nice result!
  • 25.
    #MLSEV 25 One Testis Not Enough Performance Really nice result! Likelihood But really just a lucky one
  • 26.
  • 27.
  • 28.
    #MLSEV 28 Comparing Modelsis Even Worse First digit of random seed
  • 29.
    #MLSEV 29 Please, Sir,Can I Have Some More? • Always do more than one test! • For each test, try to vary all sources of randomness that you can (change the seeds of all random processes) to try to “experience” as much variance as you can • Cross-validation (stratifying is great, monte-carlo can be a useful simplification) • Don’t just average the results! The variance is important!
  • 30.
    #MLSEV 30 Summing Up •Choose the metric that makes sense for your problem • Use held out data for testing and watch out for information leakage • Always do more than one test, varying all sources of randomness that you have control over!