Evaluation Metrics for Binary
Classification: The Ultimate Guide
jakub@neptune.ml
@NeptuneML
https://neptune.ml/blog
Jakub Czakon
● Intro
● Class-based metrics
● Score-based metrics
● Common “This vs That”
● Extras
Agenda
Intro
Intro
Evaluation Metric:
● is a model performance indicator/proxy
● (strongly) depends on the problem
● rarely maps 1-1 to your business problem
● is not a guarantee of performance on other metrics
Intro
Class-based metrics Score-based metrics
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)
y_test_pred = model.predict_proba(X_test)
y_test_class = y_test_pred > threshold
metric(y_test_true, y_test_class)
metric(y_test_true, y_test_pred)
Class-based metrics
Class-based metrics
True Negative False Negative
True PositiveFalse Positive
What is it?
Confusion Matrix
Confusion Matrix
Class-based metrics
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
fig, ax = plt.subplots()
cm = confusion_matrix(y_true, y_pred_class)
sns.heatmap(cm, cmap=plt.get_cmap('Blues'), annot=True, fmt='g', ax=ax)
ax.set_xlabel('predicted values')
ax.set_ylabel('actual values')
How to plot it?
Confusion Matrix
Class-based metrics
● Pretty much always
● I prefer nominal values to normalized (can see problems)
When to use it?
False Positive Rate | Type I error
Class-based metrics
What is it?
● When we predict something that isn’t
● Fraction of false alerts
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
False Positive Rate | Type I error
Class-based metrics
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true,
y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)
How to calculate it?
False Positive Rate | Type I error
Class-based metrics
How to choose a threshold?
False Positive Rate | Type I error
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● if the cost of dealing with an alert is high
False Negative Rate | Type II error
Class-based metrics
What is it?
● When we don’t predict something
when it is
● fraction of missed fraudulent
transactions True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)
False Negative Rate | Type II error
Class-based metrics
How to calculate it?
False Negative Rate | Type II error
Class-based metrics
How to choose a threshold?
False Negative Rate | Type II error
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● if the cost of dealing with an alert is high
True Negative Rate | Specificity
Class-based metrics
What is it?
● how good we are at predicting
negative class
● same axis as False Positive Rate
● How many non-fraudulent transactions
marked as clean
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
True Negative Rate | Specificity
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_negative_rate = tn / (tn + fp)
True Negative Rate | Specificity
Class-based metrics
How to choose a threshold?
True Negative Rate | Specificity
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● When you want to feel good when you say
“you are healthy” or “this transaction is clean”
True Positive Rate | Recall | Sensitivity
Class-based metrics
What is it?
● how good we are at finding positive
class members
● put all guilty in prison
● same axis as False Negative Rate True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
True Positive Rate | Recall | Sensitivity
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix, recall_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_positive_rate = tp / (tp + fn)
# or simply
recall_score(y_true, y_pred_class)
True Positive Rate | Recall | Sensitivity
Class-based metrics
How to choose a threshold?
True Positive Rate | Recall | Sensitivity
Class-based metrics
When to use it?
● Rarely used alone but can be auxiliary metric
● You want to catch all fraudulent transactions
● False alerts are cheap to process
Positive Predictive Value | Precision
Class-based metrics
What is it?
● how accurate are we when we say
positive class
● Only guilty people should be in prison
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Positive Predictive Value | Precision
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix, precision_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
positive_predictive_value = tp/ (tp + fp)
# or simply
precision_score(y_true, y_pred_class)
Positive Predictive Value | Precision
Class-based metrics
How to choose a threshold?
Class-based metrics
When to use it?
● Rarely used alone but can be auxiliary metric
● False alerts are expensive to process
● You want to catch only fraudulent transactions
Positive Predictive Value | Precision
Accuracy
Class-based metrics
What is it?
● Fraction of correctly classified
observations (positive and negative)
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Accuracy
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
accuracy = (tp + tn) / (tp + fp + fn + tn)
# or simply
accuracy_score(y_true, y_pred_class)
Accuracy
Class-based metrics
How to choose a threshold?
Accuracy
Class-based metrics
When to use it?
● When your problem is balanced
● When every class is equally important to you
● When you need something easy-to-explain to stakeholders
F score
Class-based metrics
What is it?
● Combines Precision and Recall into one
score
● Weighted harmonic mean
● Doesn’t care about True Negatives
● F1 score (beta=1)-> harmonic mean
● F2 score (beta=2)-> 2x emphasis on recall
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
F score
Class-based metrics
How to calculate it?
from sklearn.metrics import fbeta_score
y_pred_class = y_pred_pos > threshold
fbeta_score(y_true, y_pred_class, beta)
F score
Class-based metrics
How to choose a threshold?
F score
Class-based metrics
How to choose a threshold?
F score
Class-based metrics
How to choose a threshold?
F score
Class-based metrics
When to use it?
F1 score
● my go-to metric for binary classification
● easy-to-explain to stakeholders
F2 score
● When you need to adjust precision recall tradeoff
● when finding positive fraudulent transactions is more important than being correct about it
Cohen Kappa
Class-based metrics
What is it?
● how much than a random classifier
your model is
● Observed agreement p0 : accuracy
● Expected agreement pe : accuracy of
the random classifier
● Random classifier: samples randomly
according to class frequencies
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Cohen Kappa
Class-based metrics
How to calculate it?
from sklearn.metrics import
cohen_kappa_score
y_pred_class = y_pred_pos > threshold
cohen_kappa_score(y_true, y_pred_class)
Cohen Kappa
Class-based metrics
How to choose a threshold?
Cohen Kappa
Class-based metrics
When to use it?
● Unpopular metric for classification
● Works well for unbalanced problems
● Good substitute for accuracy when you need a
metric that is easy to explain
Matthews Correlation Coefficient
Class-based metrics
What is it?
● Correlation between predicted classes
and the ground truth labels
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Matthews Correlation Coefficient
Class-based metrics
How to calculate it?
from sklearn.metrics import matthews_corrcoef
y_pred_class = y_pred_pos > threshold
matthews_corrcoef(y_true, y_pred_class)
Matthews Correlation Coefficient
Class-based metrics
How to choose a threshold?
Matthews Correlation Coefficient
Class-based metrics
When to use it?
● Unpopular metric for classification
● Works well for unbalanced problems
● Good substitute for accuracy when you need
a metric that is easy to explain
Dollar-focused metrics
Class-based metrics
● Get the cost of False Negative (letting fraudulent transaction)
● Get the cost of False Positive (blocking of clean transaction)
● Find optimal threshold in dollars $ -> optimize business
problem directly
● Blog post from Airbnb (link)
Fairness metrics
Class-based metrics
● Divide your dataset into groups based on protected feature (race, sex, etc) ->
get privileged and unprivileged groups
● Calculate True Positive Rate for both groups
● Calculate the difference in TPR -> get Equality of Opportunity Metric
● Blog post on fairness (link)
Score-based metrics
ROC curve
Score-based metrics
What is it?
● visualizes the tradeoff between true
positive rate (TPR) and false positive
rate (FPR).
● for every threshold, we calculate TPR
and FPR and plot it on one chart.
ROC curve
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
ROC curve
Score-based metrics
When to use it?
● You want to see model performance over all thresholds
● Want to visually compare multiple models
● Care equally about both positive and negative class
ROC AUC score
Score-based metrics
What is it?
● One number that summarizes ROC curve
● Area under the ROC curve (integral)
● Alternatively, rank correlation between
predictions and targets (link)
● Is looking at the entire confusion matrix
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
ROC AUC score
Score-based metrics
How to calculate it?
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
ROC AUC score
Score-based metrics
When to use it?
● You care about ranking predictions (not about getting calibrated probabilities)
● Do not use when data heavily imbalanced and you care only about the positive class (link)
● When you care equally about the positive and negative classes
Precision Recall curve
Score-based metrics
What is it?
● visualizes the tradeoff between
precision and recall
● for every threshold, we calculate
precision and recall and plot it on one
chart
Precision Recall curve
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_precision_recall
fig, ax = plt.subplots()
plot_precision_recall(y_true, y_pred, ax=ax)
Precision Recall curve
Score-based metrics
When to use it?
● You want to see model performance over all thresholds
● Want to visually compare multiple models
● Want to find a good threshold for class assignment
● Care more about the positive class
PR AUC score | Average precision
Score-based metrics
What is it?
● One number that summarizes Precision
Recall curve
● Area under the Precision Recall curve
(integral)
● Doesn’t look at True Negatives!
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
PR AUC score | Average precision
Score-based metrics
How to calculate it?
from sklearn.metrics import average_precision_score
average_precision_score(y_true, y_pred_pos)
PR AUC score | Average precision
Score-based metrics
When to use it?
● Data is heavily imbalanced and you care only about the positive class (link)
● You care more about the positive than negative class
● you want to choose the threshold that fits the business problem
● you want to communicate precision/recall decision
Brier score
Score-based metrics
What is it?
● Distance from your predictions to
the true value (label)
Brier score
Score-based metrics
How to calculate it?
from sklearn.metrics import
brier_score_loss
brier_score_loss(y_true, y_pred_pos)
Brier score
Score-based metrics
When to use it?
● When you care about calibrated probabilities
● Why care about calibration?
y_true = [0, 1, 1, 0, 1, 1, 1, 0]
y_pred_v1 = [0.28, 0.35, 0.32, 0.29, 0.34, 0.38, 0.37, 0.31]
y_pred_v2 = [0.18, 0.95, 0.92, 0.19, 0.94, 0.98, 0.97, 0.21]
1.000, 1.000
0.295, 0.0158
roc_auc_score(y_true, y_pred_v1), roc_auc_score(y_true, y_pred_v2)
brier_score_loss(y_true, y_pred_v1), brier_score_loss(y_true, y_pred_v2)
Cumulative gain chart
Score-based metrics
What is it?
● Shows how much your model gains
over the random model
● Calculate it by:
○ Order predictions
○ Calculate fraction of True
Positives for your model and
random model
○ Plot them on one chart
Cumulative gain chart
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_cumulative_gain
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
Cumulative gain chart
Score-based metrics
When to use it?
● You want to select the most promising customers transactions from a dataset
● You are looking for a good cutoff point
● Can be a good addition to ROC AUC score
Lift chart
Score-based metrics
What is it?
● Shows how much your model gains over
the random model
● Calculate it by:
○ Order predictions
○ Calculate fraction of True Positives
for your model and random model
○ Calculate the ratio of your model
gains over random model
○ Plot them on one chart
Lift chart
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_cumulative_gain
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
Lift chart
Score-based metrics
When to use it?
● You want to select the most promising customers transactions from a dataset
● You are looking for a good cutoff point
● Can be a good addition to ROC AUC score
Common This vs That
Precision vs Recall
Common This vs That
● Looks at True Positives and False
Positives
● Only guilty should be in jail
● Higher threshold
○ less predicted positives
○ higher precision
○ lower recall
Precision
● Looks at True Positives and False
Negatives
● Put all guilty in jail
● Lower threshold
○ more predicted positives
○ higher recall
○ lower precision
Recall
F1 score vs Accuracy
Common This vs That
● Doesn’t look at True Negatives
● You care more about the positive
class
● Balances Precision and Recall
F1 score
● Looks at all elements from the
confusion matrix
● You care equally about positive
and negative class
● Very bad for imbalanced problems
Accuracy
ROC AUC vs PR AUC
Common This vs That
● Looks at all elements from the
confusion matrix (for all
thresholds)
● You care equally about positive
and negative class
● Data is not heavily imbalanced
ROC AUC
● Doesn’t look at True Negatives (for
all thresholds)
● You care more about positive
class
● Data is heavily imbalanced
PR AUC
F1 score vs ROC AUC
Common This vs That
● Class based metric -> you need to
choose a threshold
● You care more about positive than
negative class
● Doesn’t look at True Negatives
● Easier to communicate
F1 score
● Score based metric
● Good for ranking predictions
● You care both about positive and
negative class
● You data is not heavily imbalanced
ROC AUC
Extras
Materials
● Slides available on twitter (@NeptuneML) linkedin/slideshare (link)
● Github repository github.com/neptune-ml (link)
● Blog post “24 Evaluation Metrics for Binary Classification (And
When to Use Them)”
● Blog post “F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which
Evaluation Metric Should You Choose?”
Extras
Metrics cheatsheet
● Download the cheatsheet (link)
Extras
Binary metrics logger
● Log binary metrics with one
function call
● pip install neptune-contrib (link)
● example (link)
Extras
import neptune
import neptunecontrib.monitoring.metrics as npt_metrics
neptune.create_experiment()
npt_metrics.log_binary_classification_metrics(y_true, y_pred)
Fairness metrics logger
● Log fairness metrics with one function call
● pip install neptune-contrib (link)
● example (link)
Extras
import neptune
import neptunecontrib.monitoring.fairness as npt_fair
neptune.create_experiment()
npt_fair.log_fairness_classification_metrics(
y_true, y_pred, y_class, x_protected,
favorable_label=0, unfavorable_label=1,
privileged_groups={'race':[3]},
unprivileged_groups={'race':[1,2,4]})
The most lightweight experiment
management tool that fits any workflow.
jakub@neptune.ml
@NeptuneML
https://neptune.ml/blog
Jakub Czakon

Evaluation metrics for binary classification - the ultimate guide

  • 1.
    Evaluation Metrics forBinary Classification: The Ultimate Guide jakub@neptune.ml @NeptuneML https://neptune.ml/blog Jakub Czakon
  • 2.
    ● Intro ● Class-basedmetrics ● Score-based metrics ● Common “This vs That” ● Extras Agenda
  • 3.
  • 4.
    Intro Evaluation Metric: ● isa model performance indicator/proxy ● (strongly) depends on the problem ● rarely maps 1-1 to your business problem ● is not a guarantee of performance on other metrics
  • 5.
    Intro Class-based metrics Score-basedmetrics model = lightgbm.LGBMClassifier(**MODEL_PARAMS) model.fit(X_train, y_train) y_test_pred = model.predict_proba(X_test) y_test_class = y_test_pred > threshold metric(y_test_true, y_test_class) metric(y_test_true, y_test_pred)
  • 6.
  • 7.
    Class-based metrics True NegativeFalse Negative True PositiveFalse Positive What is it? Confusion Matrix
  • 8.
    Confusion Matrix Class-based metrics importmatplotlib.pyplot as plt import seaborn as sns from sklearn.metrics import confusion_matrix fig, ax = plt.subplots() cm = confusion_matrix(y_true, y_pred_class) sns.heatmap(cm, cmap=plt.get_cmap('Blues'), annot=True, fmt='g', ax=ax) ax.set_xlabel('predicted values') ax.set_ylabel('actual values') How to plot it?
  • 9.
    Confusion Matrix Class-based metrics ●Pretty much always ● I prefer nominal values to normalized (can see problems) When to use it?
  • 10.
    False Positive Rate| Type I error Class-based metrics What is it? ● When we predict something that isn’t ● Fraction of false alerts True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 11.
    False Positive Rate| Type I error Class-based metrics from sklearn.metrics import confusion_matrix y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() false_positive_rate = fp / (fp + tn) How to calculate it?
  • 12.
    False Positive Rate| Type I error Class-based metrics How to choose a threshold?
  • 13.
    False Positive Rate| Type I error Class-based metrics When to use it? ● rarely used alone but can be auxiliary metric ● if the cost of dealing with an alert is high
  • 14.
    False Negative Rate| Type II error Class-based metrics What is it? ● When we don’t predict something when it is ● fraction of missed fraudulent transactions True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 15.
    from sklearn.metrics importconfusion_matrix y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() false_negative_rate = fn / (tp + fn) False Negative Rate | Type II error Class-based metrics How to calculate it?
  • 16.
    False Negative Rate| Type II error Class-based metrics How to choose a threshold?
  • 17.
    False Negative Rate| Type II error Class-based metrics When to use it? ● rarely used alone but can be auxiliary metric ● if the cost of dealing with an alert is high
  • 18.
    True Negative Rate| Specificity Class-based metrics What is it? ● how good we are at predicting negative class ● same axis as False Positive Rate ● How many non-fraudulent transactions marked as clean True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 19.
    True Negative Rate| Specificity Class-based metrics How to calculate it? from sklearn.metrics import confusion_matrix y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() true_negative_rate = tn / (tn + fp)
  • 20.
    True Negative Rate| Specificity Class-based metrics How to choose a threshold?
  • 21.
    True Negative Rate| Specificity Class-based metrics When to use it? ● rarely used alone but can be auxiliary metric ● When you want to feel good when you say “you are healthy” or “this transaction is clean”
  • 22.
    True Positive Rate| Recall | Sensitivity Class-based metrics What is it? ● how good we are at finding positive class members ● put all guilty in prison ● same axis as False Negative Rate True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 23.
    True Positive Rate| Recall | Sensitivity Class-based metrics How to calculate it? from sklearn.metrics import confusion_matrix, recall_score y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() true_positive_rate = tp / (tp + fn) # or simply recall_score(y_true, y_pred_class)
  • 24.
    True Positive Rate| Recall | Sensitivity Class-based metrics How to choose a threshold?
  • 25.
    True Positive Rate| Recall | Sensitivity Class-based metrics When to use it? ● Rarely used alone but can be auxiliary metric ● You want to catch all fraudulent transactions ● False alerts are cheap to process
  • 26.
    Positive Predictive Value| Precision Class-based metrics What is it? ● how accurate are we when we say positive class ● Only guilty people should be in prison True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 27.
    Positive Predictive Value| Precision Class-based metrics How to calculate it? from sklearn.metrics import confusion_matrix, precision_score y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() positive_predictive_value = tp/ (tp + fp) # or simply precision_score(y_true, y_pred_class)
  • 28.
    Positive Predictive Value| Precision Class-based metrics How to choose a threshold?
  • 29.
    Class-based metrics When touse it? ● Rarely used alone but can be auxiliary metric ● False alerts are expensive to process ● You want to catch only fraudulent transactions Positive Predictive Value | Precision
  • 30.
    Accuracy Class-based metrics What isit? ● Fraction of correctly classified observations (positive and negative) True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 31.
    Accuracy Class-based metrics How tocalculate it? from sklearn.metrics import confusion_matrix, accuracy_score y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() accuracy = (tp + tn) / (tp + fp + fn + tn) # or simply accuracy_score(y_true, y_pred_class)
  • 32.
  • 33.
    Accuracy Class-based metrics When touse it? ● When your problem is balanced ● When every class is equally important to you ● When you need something easy-to-explain to stakeholders
  • 34.
    F score Class-based metrics Whatis it? ● Combines Precision and Recall into one score ● Weighted harmonic mean ● Doesn’t care about True Negatives ● F1 score (beta=1)-> harmonic mean ● F2 score (beta=2)-> 2x emphasis on recall True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 35.
    F score Class-based metrics Howto calculate it? from sklearn.metrics import fbeta_score y_pred_class = y_pred_pos > threshold fbeta_score(y_true, y_pred_class, beta)
  • 36.
    F score Class-based metrics Howto choose a threshold?
  • 37.
    F score Class-based metrics Howto choose a threshold?
  • 38.
    F score Class-based metrics Howto choose a threshold?
  • 39.
    F score Class-based metrics Whento use it? F1 score ● my go-to metric for binary classification ● easy-to-explain to stakeholders F2 score ● When you need to adjust precision recall tradeoff ● when finding positive fraudulent transactions is more important than being correct about it
  • 40.
    Cohen Kappa Class-based metrics Whatis it? ● how much than a random classifier your model is ● Observed agreement p0 : accuracy ● Expected agreement pe : accuracy of the random classifier ● Random classifier: samples randomly according to class frequencies True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 41.
    Cohen Kappa Class-based metrics Howto calculate it? from sklearn.metrics import cohen_kappa_score y_pred_class = y_pred_pos > threshold cohen_kappa_score(y_true, y_pred_class)
  • 42.
  • 43.
    Cohen Kappa Class-based metrics Whento use it? ● Unpopular metric for classification ● Works well for unbalanced problems ● Good substitute for accuracy when you need a metric that is easy to explain
  • 44.
    Matthews Correlation Coefficient Class-basedmetrics What is it? ● Correlation between predicted classes and the ground truth labels True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 45.
    Matthews Correlation Coefficient Class-basedmetrics How to calculate it? from sklearn.metrics import matthews_corrcoef y_pred_class = y_pred_pos > threshold matthews_corrcoef(y_true, y_pred_class)
  • 46.
    Matthews Correlation Coefficient Class-basedmetrics How to choose a threshold?
  • 47.
    Matthews Correlation Coefficient Class-basedmetrics When to use it? ● Unpopular metric for classification ● Works well for unbalanced problems ● Good substitute for accuracy when you need a metric that is easy to explain
  • 48.
    Dollar-focused metrics Class-based metrics ●Get the cost of False Negative (letting fraudulent transaction) ● Get the cost of False Positive (blocking of clean transaction) ● Find optimal threshold in dollars $ -> optimize business problem directly ● Blog post from Airbnb (link)
  • 49.
    Fairness metrics Class-based metrics ●Divide your dataset into groups based on protected feature (race, sex, etc) -> get privileged and unprivileged groups ● Calculate True Positive Rate for both groups ● Calculate the difference in TPR -> get Equality of Opportunity Metric ● Blog post on fairness (link)
  • 50.
  • 51.
    ROC curve Score-based metrics Whatis it? ● visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). ● for every threshold, we calculate TPR and FPR and plot it on one chart.
  • 52.
    ROC curve Score-based metrics Howto plot it? from scikitplot.metrics import plot_roc fig, ax = plt.subplots() plot_roc(y_true, y_pred, ax=ax)
  • 53.
    ROC curve Score-based metrics Whento use it? ● You want to see model performance over all thresholds ● Want to visually compare multiple models ● Care equally about both positive and negative class
  • 54.
    ROC AUC score Score-basedmetrics What is it? ● One number that summarizes ROC curve ● Area under the ROC curve (integral) ● Alternatively, rank correlation between predictions and targets (link) ● Is looking at the entire confusion matrix True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 55.
    ROC AUC score Score-basedmetrics How to calculate it? from scikitplot.metrics import plot_roc fig, ax = plt.subplots() plot_roc(y_true, y_pred, ax=ax)
  • 56.
    ROC AUC score Score-basedmetrics When to use it? ● You care about ranking predictions (not about getting calibrated probabilities) ● Do not use when data heavily imbalanced and you care only about the positive class (link) ● When you care equally about the positive and negative classes
  • 57.
    Precision Recall curve Score-basedmetrics What is it? ● visualizes the tradeoff between precision and recall ● for every threshold, we calculate precision and recall and plot it on one chart
  • 58.
    Precision Recall curve Score-basedmetrics How to plot it? from scikitplot.metrics import plot_precision_recall fig, ax = plt.subplots() plot_precision_recall(y_true, y_pred, ax=ax)
  • 59.
    Precision Recall curve Score-basedmetrics When to use it? ● You want to see model performance over all thresholds ● Want to visually compare multiple models ● Want to find a good threshold for class assignment ● Care more about the positive class
  • 60.
    PR AUC score| Average precision Score-based metrics What is it? ● One number that summarizes Precision Recall curve ● Area under the Precision Recall curve (integral) ● Doesn’t look at True Negatives! True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 61.
    PR AUC score| Average precision Score-based metrics How to calculate it? from sklearn.metrics import average_precision_score average_precision_score(y_true, y_pred_pos)
  • 62.
    PR AUC score| Average precision Score-based metrics When to use it? ● Data is heavily imbalanced and you care only about the positive class (link) ● You care more about the positive than negative class ● you want to choose the threshold that fits the business problem ● you want to communicate precision/recall decision
  • 63.
    Brier score Score-based metrics Whatis it? ● Distance from your predictions to the true value (label)
  • 64.
    Brier score Score-based metrics Howto calculate it? from sklearn.metrics import brier_score_loss brier_score_loss(y_true, y_pred_pos)
  • 65.
    Brier score Score-based metrics Whento use it? ● When you care about calibrated probabilities ● Why care about calibration? y_true = [0, 1, 1, 0, 1, 1, 1, 0] y_pred_v1 = [0.28, 0.35, 0.32, 0.29, 0.34, 0.38, 0.37, 0.31] y_pred_v2 = [0.18, 0.95, 0.92, 0.19, 0.94, 0.98, 0.97, 0.21] 1.000, 1.000 0.295, 0.0158 roc_auc_score(y_true, y_pred_v1), roc_auc_score(y_true, y_pred_v2) brier_score_loss(y_true, y_pred_v1), brier_score_loss(y_true, y_pred_v2)
  • 66.
    Cumulative gain chart Score-basedmetrics What is it? ● Shows how much your model gains over the random model ● Calculate it by: ○ Order predictions ○ Calculate fraction of True Positives for your model and random model ○ Plot them on one chart
  • 67.
    Cumulative gain chart Score-basedmetrics How to plot it? from scikitplot.metrics import plot_cumulative_gain fig, ax = plt.subplots() plot_cumulative_gain(y_true, y_pred, ax=ax)
  • 68.
    Cumulative gain chart Score-basedmetrics When to use it? ● You want to select the most promising customers transactions from a dataset ● You are looking for a good cutoff point ● Can be a good addition to ROC AUC score
  • 69.
    Lift chart Score-based metrics Whatis it? ● Shows how much your model gains over the random model ● Calculate it by: ○ Order predictions ○ Calculate fraction of True Positives for your model and random model ○ Calculate the ratio of your model gains over random model ○ Plot them on one chart
  • 70.
    Lift chart Score-based metrics Howto plot it? from scikitplot.metrics import plot_cumulative_gain fig, ax = plt.subplots() plot_cumulative_gain(y_true, y_pred, ax=ax)
  • 71.
    Lift chart Score-based metrics Whento use it? ● You want to select the most promising customers transactions from a dataset ● You are looking for a good cutoff point ● Can be a good addition to ROC AUC score
  • 72.
  • 73.
    Precision vs Recall CommonThis vs That ● Looks at True Positives and False Positives ● Only guilty should be in jail ● Higher threshold ○ less predicted positives ○ higher precision ○ lower recall Precision ● Looks at True Positives and False Negatives ● Put all guilty in jail ● Lower threshold ○ more predicted positives ○ higher recall ○ lower precision Recall
  • 74.
    F1 score vsAccuracy Common This vs That ● Doesn’t look at True Negatives ● You care more about the positive class ● Balances Precision and Recall F1 score ● Looks at all elements from the confusion matrix ● You care equally about positive and negative class ● Very bad for imbalanced problems Accuracy
  • 75.
    ROC AUC vsPR AUC Common This vs That ● Looks at all elements from the confusion matrix (for all thresholds) ● You care equally about positive and negative class ● Data is not heavily imbalanced ROC AUC ● Doesn’t look at True Negatives (for all thresholds) ● You care more about positive class ● Data is heavily imbalanced PR AUC
  • 76.
    F1 score vsROC AUC Common This vs That ● Class based metric -> you need to choose a threshold ● You care more about positive than negative class ● Doesn’t look at True Negatives ● Easier to communicate F1 score ● Score based metric ● Good for ranking predictions ● You care both about positive and negative class ● You data is not heavily imbalanced ROC AUC
  • 77.
  • 78.
    Materials ● Slides availableon twitter (@NeptuneML) linkedin/slideshare (link) ● Github repository github.com/neptune-ml (link) ● Blog post “24 Evaluation Metrics for Binary Classification (And When to Use Them)” ● Blog post “F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?” Extras
  • 79.
    Metrics cheatsheet ● Downloadthe cheatsheet (link) Extras
  • 80.
    Binary metrics logger ●Log binary metrics with one function call ● pip install neptune-contrib (link) ● example (link) Extras import neptune import neptunecontrib.monitoring.metrics as npt_metrics neptune.create_experiment() npt_metrics.log_binary_classification_metrics(y_true, y_pred)
  • 81.
    Fairness metrics logger ●Log fairness metrics with one function call ● pip install neptune-contrib (link) ● example (link) Extras import neptune import neptunecontrib.monitoring.fairness as npt_fair neptune.create_experiment() npt_fair.log_fairness_classification_metrics( y_true, y_pred, y_class, x_protected, favorable_label=0, unfavorable_label=1, privileged_groups={'race':[3]}, unprivileged_groups={'race':[1,2,4]})
  • 82.
    The most lightweightexperiment management tool that fits any workflow. jakub@neptune.ml @NeptuneML https://neptune.ml/blog Jakub Czakon