Presentation from PyData Warsaw 2019 by Jakub Czakon.
Choosing a proper metric is a crucial yet difficult part of the machine learning project. In this talk, you will learn about a number of common and lesser-known metrics and performance charts as well as typical decisions when it comes to choosing one for your project. Hopefully, with all that knowledge you will be fully equipped to deal with metric-related problems in your future projects!
4. Intro
Evaluation Metric:
● is a model performance indicator/proxy
● (strongly) depends on the problem
● rarely maps 1-1 to your business problem
● is not a guarantee of performance on other metrics
10. False Positive Rate | Type I error
Class-based metrics
What is it?
● When we predict something that isn’t
● Fraction of false alerts
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
11. False Positive Rate | Type I error
Class-based metrics
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true,
y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)
How to calculate it?
12. False Positive Rate | Type I error
Class-based metrics
How to choose a threshold?
13. False Positive Rate | Type I error
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● if the cost of dealing with an alert is high
14. False Negative Rate | Type II error
Class-based metrics
What is it?
● When we don’t predict something
when it is
● fraction of missed fraudulent
transactions True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
15. from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)
False Negative Rate | Type II error
Class-based metrics
How to calculate it?
16. False Negative Rate | Type II error
Class-based metrics
How to choose a threshold?
17. False Negative Rate | Type II error
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● if the cost of dealing with an alert is high
18. True Negative Rate | Specificity
Class-based metrics
What is it?
● how good we are at predicting
negative class
● same axis as False Positive Rate
● How many non-fraudulent transactions
marked as clean
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
20. True Negative Rate | Specificity
Class-based metrics
How to choose a threshold?
21. True Negative Rate | Specificity
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● When you want to feel good when you say
“you are healthy” or “this transaction is clean”
22. True Positive Rate | Recall | Sensitivity
Class-based metrics
What is it?
● how good we are at finding positive
class members
● put all guilty in prison
● same axis as False Negative Rate True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
24. True Positive Rate | Recall | Sensitivity
Class-based metrics
How to choose a threshold?
25. True Positive Rate | Recall | Sensitivity
Class-based metrics
When to use it?
● Rarely used alone but can be auxiliary metric
● You want to catch all fraudulent transactions
● False alerts are cheap to process
26. Positive Predictive Value | Precision
Class-based metrics
What is it?
● how accurate are we when we say
positive class
● Only guilty people should be in prison
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
27. Positive Predictive Value | Precision
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix, precision_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
positive_predictive_value = tp/ (tp + fp)
# or simply
precision_score(y_true, y_pred_class)
29. Class-based metrics
When to use it?
● Rarely used alone but can be auxiliary metric
● False alerts are expensive to process
● You want to catch only fraudulent transactions
Positive Predictive Value | Precision
30. Accuracy
Class-based metrics
What is it?
● Fraction of correctly classified
observations (positive and negative)
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
33. Accuracy
Class-based metrics
When to use it?
● When your problem is balanced
● When every class is equally important to you
● When you need something easy-to-explain to stakeholders
34. F score
Class-based metrics
What is it?
● Combines Precision and Recall into one
score
● Weighted harmonic mean
● Doesn’t care about True Negatives
● F1 score (beta=1)-> harmonic mean
● F2 score (beta=2)-> 2x emphasis on recall
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
35. F score
Class-based metrics
How to calculate it?
from sklearn.metrics import fbeta_score
y_pred_class = y_pred_pos > threshold
fbeta_score(y_true, y_pred_class, beta)
39. F score
Class-based metrics
When to use it?
F1 score
● my go-to metric for binary classification
● easy-to-explain to stakeholders
F2 score
● When you need to adjust precision recall tradeoff
● when finding positive fraudulent transactions is more important than being correct about it
40. Cohen Kappa
Class-based metrics
What is it?
● how much than a random classifier
your model is
● Observed agreement p0 : accuracy
● Expected agreement pe : accuracy of
the random classifier
● Random classifier: samples randomly
according to class frequencies
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
41. Cohen Kappa
Class-based metrics
How to calculate it?
from sklearn.metrics import
cohen_kappa_score
y_pred_class = y_pred_pos > threshold
cohen_kappa_score(y_true, y_pred_class)
43. Cohen Kappa
Class-based metrics
When to use it?
● Unpopular metric for classification
● Works well for unbalanced problems
● Good substitute for accuracy when you need a
metric that is easy to explain
44. Matthews Correlation Coefficient
Class-based metrics
What is it?
● Correlation between predicted classes
and the ground truth labels
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
45. Matthews Correlation Coefficient
Class-based metrics
How to calculate it?
from sklearn.metrics import matthews_corrcoef
y_pred_class = y_pred_pos > threshold
matthews_corrcoef(y_true, y_pred_class)
47. Matthews Correlation Coefficient
Class-based metrics
When to use it?
● Unpopular metric for classification
● Works well for unbalanced problems
● Good substitute for accuracy when you need
a metric that is easy to explain
48. Dollar-focused metrics
Class-based metrics
● Get the cost of False Negative (letting fraudulent transaction)
● Get the cost of False Positive (blocking of clean transaction)
● Find optimal threshold in dollars $ -> optimize business
problem directly
● Blog post from Airbnb (link)
49. Fairness metrics
Class-based metrics
● Divide your dataset into groups based on protected feature (race, sex, etc) ->
get privileged and unprivileged groups
● Calculate True Positive Rate for both groups
● Calculate the difference in TPR -> get Equality of Opportunity Metric
● Blog post on fairness (link)
51. ROC curve
Score-based metrics
What is it?
● visualizes the tradeoff between true
positive rate (TPR) and false positive
rate (FPR).
● for every threshold, we calculate TPR
and FPR and plot it on one chart.
52. ROC curve
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
53. ROC curve
Score-based metrics
When to use it?
● You want to see model performance over all thresholds
● Want to visually compare multiple models
● Care equally about both positive and negative class
54. ROC AUC score
Score-based metrics
What is it?
● One number that summarizes ROC curve
● Area under the ROC curve (integral)
● Alternatively, rank correlation between
predictions and targets (link)
● Is looking at the entire confusion matrix
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
55. ROC AUC score
Score-based metrics
How to calculate it?
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
56. ROC AUC score
Score-based metrics
When to use it?
● You care about ranking predictions (not about getting calibrated probabilities)
● Do not use when data heavily imbalanced and you care only about the positive class (link)
● When you care equally about the positive and negative classes
57. Precision Recall curve
Score-based metrics
What is it?
● visualizes the tradeoff between
precision and recall
● for every threshold, we calculate
precision and recall and plot it on one
chart
58. Precision Recall curve
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_precision_recall
fig, ax = plt.subplots()
plot_precision_recall(y_true, y_pred, ax=ax)
59. Precision Recall curve
Score-based metrics
When to use it?
● You want to see model performance over all thresholds
● Want to visually compare multiple models
● Want to find a good threshold for class assignment
● Care more about the positive class
60. PR AUC score | Average precision
Score-based metrics
What is it?
● One number that summarizes Precision
Recall curve
● Area under the Precision Recall curve
(integral)
● Doesn’t look at True Negatives!
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
61. PR AUC score | Average precision
Score-based metrics
How to calculate it?
from sklearn.metrics import average_precision_score
average_precision_score(y_true, y_pred_pos)
62. PR AUC score | Average precision
Score-based metrics
When to use it?
● Data is heavily imbalanced and you care only about the positive class (link)
● You care more about the positive than negative class
● you want to choose the threshold that fits the business problem
● you want to communicate precision/recall decision
65. Brier score
Score-based metrics
When to use it?
● When you care about calibrated probabilities
● Why care about calibration?
y_true = [0, 1, 1, 0, 1, 1, 1, 0]
y_pred_v1 = [0.28, 0.35, 0.32, 0.29, 0.34, 0.38, 0.37, 0.31]
y_pred_v2 = [0.18, 0.95, 0.92, 0.19, 0.94, 0.98, 0.97, 0.21]
1.000, 1.000
0.295, 0.0158
roc_auc_score(y_true, y_pred_v1), roc_auc_score(y_true, y_pred_v2)
brier_score_loss(y_true, y_pred_v1), brier_score_loss(y_true, y_pred_v2)
66. Cumulative gain chart
Score-based metrics
What is it?
● Shows how much your model gains
over the random model
● Calculate it by:
○ Order predictions
○ Calculate fraction of True
Positives for your model and
random model
○ Plot them on one chart
67. Cumulative gain chart
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_cumulative_gain
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
68. Cumulative gain chart
Score-based metrics
When to use it?
● You want to select the most promising customers transactions from a dataset
● You are looking for a good cutoff point
● Can be a good addition to ROC AUC score
69. Lift chart
Score-based metrics
What is it?
● Shows how much your model gains over
the random model
● Calculate it by:
○ Order predictions
○ Calculate fraction of True Positives
for your model and random model
○ Calculate the ratio of your model
gains over random model
○ Plot them on one chart
70. Lift chart
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_cumulative_gain
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
71. Lift chart
Score-based metrics
When to use it?
● You want to select the most promising customers transactions from a dataset
● You are looking for a good cutoff point
● Can be a good addition to ROC AUC score
73. Precision vs Recall
Common This vs That
● Looks at True Positives and False
Positives
● Only guilty should be in jail
● Higher threshold
○ less predicted positives
○ higher precision
○ lower recall
Precision
● Looks at True Positives and False
Negatives
● Put all guilty in jail
● Lower threshold
○ more predicted positives
○ higher recall
○ lower precision
Recall
74. F1 score vs Accuracy
Common This vs That
● Doesn’t look at True Negatives
● You care more about the positive
class
● Balances Precision and Recall
F1 score
● Looks at all elements from the
confusion matrix
● You care equally about positive
and negative class
● Very bad for imbalanced problems
Accuracy
75. ROC AUC vs PR AUC
Common This vs That
● Looks at all elements from the
confusion matrix (for all
thresholds)
● You care equally about positive
and negative class
● Data is not heavily imbalanced
ROC AUC
● Doesn’t look at True Negatives (for
all thresholds)
● You care more about positive
class
● Data is heavily imbalanced
PR AUC
76. F1 score vs ROC AUC
Common This vs That
● Class based metric -> you need to
choose a threshold
● You care more about positive than
negative class
● Doesn’t look at True Negatives
● Easier to communicate
F1 score
● Score based metric
● Good for ranking predictions
● You care both about positive and
negative class
● You data is not heavily imbalanced
ROC AUC
78. Materials
● Slides available on twitter (@NeptuneML) linkedin/slideshare (link)
● Github repository github.com/neptune-ml (link)
● Blog post “24 Evaluation Metrics for Binary Classification (And
When to Use Them)”
● Blog post “F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which
Evaluation Metric Should You Choose?”
Extras