Binary Classification Metrics: The Ultimate Guide for Evaluating Model Performance

Evaluation Metrics for Binary
Classiﬁcation: The Ultimate Guide
jakub@neptune.ml
@NeptuneML
https://neptune.ml/blog
Jakub Czakon

● Intro
● Class-based metrics
● Score-based metrics
● Common “This vs That”
● Extras
Agenda

Intro
Evaluation Metric:
● is a model performance indicator/proxy
● (strongly) depends on the problem
● rarely maps 1-1 to your business problem
● is not a guarantee of performance on other metrics

Intro
Class-based metrics Score-based metrics
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)
y_test_pred = model.predict_proba(X_test)
y_test_class = y_test_pred > threshold
metric(y_test_true, y_test_class)
metric(y_test_true, y_test_pred)

Class-based metrics
True Negative False Negative
True PositiveFalse Positive
What is it?
Confusion Matrix

Confusion Matrix
Class-based metrics
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
fig, ax = plt.subplots()
cm = confusion_matrix(y_true, y_pred_class)
sns.heatmap(cm, cmap=plt.get_cmap('Blues'), annot=True, fmt='g', ax=ax)
ax.set_xlabel('predicted values')
ax.set_ylabel('actual values')
How to plot it?

Confusion Matrix
Class-based metrics
● Pretty much always
● I prefer nominal values to normalized (can see problems)
When to use it?

False Positive Rate | Type I error
Class-based metrics
What is it?
● When we predict something that isn’t
● Fraction of false alerts
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0

Class-based metrics
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true,
y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)
How to calculate it?

Class-based metrics
How to choose a threshold?

Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● if the cost of dealing with an alert is high

False Negative Rate | Type II error
Class-based metrics
What is it?
● When we don’t predict something
when it is
● fraction of missed fraudulent
transactions True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0

tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)
Class-based metrics

Class-based metrics

Class-based metrics
When to use it?
● if the cost of dealing with an alert is high

True Negative Rate | Speciﬁcity
Class-based metrics
What is it?
● how good we are at predicting
negative class
● same axis as False Positive Rate
● How many non-fraudulent transactions
marked as clean
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0

Class-based metrics
true_negative_rate = tn / (tn + fp)

Class-based metrics

Class-based metrics
When to use it?
● When you want to feel good when you say
“you are healthy” or “this transaction is clean”

True Positive Rate | Recall | Sensitivity
Class-based metrics
What is it?
● how good we are at ﬁnding positive
class members
● put all guilty in prison
● same axis as False Negative Rate True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0

Class-based metrics
from sklearn.metrics import confusion_matrix, recall_score
true_positive_rate = tp / (tp + fn)
# or simply
recall_score(y_true, y_pred_class)

Class-based metrics

Class-based metrics
When to use it?
● Rarely used alone but can be auxiliary metric
● You want to catch all fraudulent transactions
● False alerts are cheap to process

Positive Predictive Value | Precision
Class-based metrics
What is it?
● how accurate are we when we say
positive class
● Only guilty people should be in prison
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0

Class-based metrics
from sklearn.metrics import confusion_matrix, precision_score
positive_predictive_value = tp/ (tp + fp)
# or simply
precision_score(y_true, y_pred_class)

Class-based metrics

Class-based metrics
When to use it?
● Rarely used alone but can be auxiliary metric
● False alerts are expensive to process
● You want to catch only fraudulent transactions

Accuracy
Class-based metrics
What is it?
● Fraction of correctly classiﬁed
observations (positive and negative)
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0

Accuracy
Class-based metrics
from sklearn.metrics import confusion_matrix, accuracy_score
accuracy = (tp + tn) / (tp + fp + fn + tn)
# or simply
accuracy_score(y_true, y_pred_class)

Accuracy
Class-based metrics

Accuracy
Class-based metrics
When to use it?
● When your problem is balanced
● When every class is equally important to you
● When you need something easy-to-explain to stakeholders

F score
Class-based metrics
What is it?
● Combines Precision and Recall into one
score
● Weighted harmonic mean
● Doesn’t care about True Negatives
● F1 score (beta=1)-> harmonic mean
● F2 score (beta=2)-> 2x emphasis on recall
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0

F score
Class-based metrics
from sklearn.metrics import fbeta_score
fbeta_score(y_true, y_pred_class, beta)

F score
Class-based metrics

F score
Class-based metrics
When to use it?
F1 score
● my go-to metric for binary classiﬁcation
● easy-to-explain to stakeholders
F2 score
● When you need to adjust precision recall tradeoff
● when ﬁnding positive fraudulent transactions is more important than being correct about it

Cohen Kappa
Class-based metrics
What is it?
● how much than a random classifier
your model is
● Observed agreement p0 : accuracy
● Expected agreement pe : accuracy of
the random classifier
● Random classifier: samples randomly
according to class frequencies
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0

Cohen Kappa
Class-based metrics
from sklearn.metrics import
cohen_kappa_score
cohen_kappa_score(y_true, y_pred_class)

Cohen Kappa
Class-based metrics

Cohen Kappa
Class-based metrics
When to use it?
● Unpopular metric for classiﬁcation
● Works well for unbalanced problems
● Good substitute for accuracy when you need a
metric that is easy to explain

Matthews Correlation Coeﬃcient
Class-based metrics
What is it?
● Correlation between predicted classes
and the ground truth labels
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0

Class-based metrics
from sklearn.metrics import matthews_corrcoef
matthews_corrcoef(y_true, y_pred_class)

Class-based metrics

Class-based metrics
When to use it?
● Unpopular metric for classiﬁcation
● Works well for unbalanced problems
● Good substitute for accuracy when you need
a metric that is easy to explain

Dollar-focused metrics
Class-based metrics
● Get the cost of False Negative (letting fraudulent transaction)
● Get the cost of False Positive (blocking of clean transaction)
● Find optimal threshold in dollars $ -> optimize business
problem directly
● Blog post from Airbnb (link)

Fairness metrics
Class-based metrics
● Divide your dataset into groups based on protected feature (race, sex, etc) ->
get privileged and unprivileged groups
● Calculate True Positive Rate for both groups
● Calculate the difference in TPR -> get Equality of Opportunity Metric
● Blog post on fairness (link)

ROC curve
Score-based metrics
What is it?
● visualizes the tradeoff between true
positive rate (TPR) and false positive
rate (FPR).
● for every threshold, we calculate TPR
and FPR and plot it on one chart.

ROC curve
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_roc
plot_roc(y_true, y_pred, ax=ax)

ROC curve
Score-based metrics
When to use it?
● You want to see model performance over all thresholds
● Want to visually compare multiple models
● Care equally about both positive and negative class

ROC AUC score
Score-based metrics
What is it?
● One number that summarizes ROC curve
● Area under the ROC curve (integral)
● Alternatively, rank correlation between
predictions and targets (link)
● Is looking at the entire confusion matrix
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0

ROC AUC score
Score-based metrics
from scikitplot.metrics import plot_roc
plot_roc(y_true, y_pred, ax=ax)

ROC AUC score
Score-based metrics
When to use it?
● You care about ranking predictions (not about getting calibrated probabilities)
● Do not use when data heavily imbalanced and you care only about the positive class (link)
● When you care equally about the positive and negative classes

Precision Recall curve
Score-based metrics
What is it?
● visualizes the tradeoff between
precision and recall
● for every threshold, we calculate
precision and recall and plot it on one
chart

Score-based metrics
How to plot it?
from scikitplot.metrics import plot_precision_recall
plot_precision_recall(y_true, y_pred, ax=ax)

Score-based metrics
When to use it?
● You want to see model performance over all thresholds
● Want to visually compare multiple models
● Want to ﬁnd a good threshold for class assignment
● Care more about the positive class

PR AUC score | Average precision
Score-based metrics
What is it?
● One number that summarizes Precision
Recall curve
● Area under the Precision Recall curve
(integral)
● Doesn’t look at True Negatives!
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0

Score-based metrics
from sklearn.metrics import average_precision_score
average_precision_score(y_true, y_pred_pos)

Score-based metrics
When to use it?
● Data is heavily imbalanced and you care only about the positive class (link)
● You care more about the positive than negative class
● you want to choose the threshold that ﬁts the business problem
● you want to communicate precision/recall decision

Brier score
Score-based metrics
What is it?
● Distance from your predictions to
the true value (label)

Brier score
Score-based metrics
from sklearn.metrics import
brier_score_loss
brier_score_loss(y_true, y_pred_pos)

Brier score
Score-based metrics
When to use it?
● When you care about calibrated probabilities
● Why care about calibration?
y_true = [0, 1, 1, 0, 1, 1, 1, 0]
y_pred_v1 = [0.28, 0.35, 0.32, 0.29, 0.34, 0.38, 0.37, 0.31]
y_pred_v2 = [0.18, 0.95, 0.92, 0.19, 0.94, 0.98, 0.97, 0.21]
1.000, 1.000
0.295, 0.0158
roc_auc_score(y_true, y_pred_v1), roc_auc_score(y_true, y_pred_v2)
brier_score_loss(y_true, y_pred_v1), brier_score_loss(y_true, y_pred_v2)

Cumulative gain chart
Score-based metrics
What is it?
● Shows how much your model gains
over the random model
● Calculate it by:
○ Order predictions
○ Calculate fraction of True
Positives for your model and
random model
○ Plot them on one chart

Score-based metrics
How to plot it?
from scikitplot.metrics import plot_cumulative_gain
plot_cumulative_gain(y_true, y_pred, ax=ax)

Score-based metrics
When to use it?
● You want to select the most promising customers transactions from a dataset
● You are looking for a good cutoff point
● Can be a good addition to ROC AUC score

Lift chart
Score-based metrics
What is it?
● Shows how much your model gains over
the random model
● Calculate it by:
○ Order predictions
○ Calculate fraction of True Positives
for your model and random model
○ Calculate the ratio of your model
gains over random model
○ Plot them on one chart

Lift chart
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_cumulative_gain
plot_cumulative_gain(y_true, y_pred, ax=ax)

Lift chart
Score-based metrics
When to use it?
● You want to select the most promising customers transactions from a dataset
● You are looking for a good cutoff point
● Can be a good addition to ROC AUC score

Precision vs Recall
Common This vs That
● Looks at True Positives and False
Positives
● Only guilty should be in jail
● Higher threshold
○ less predicted positives
○ higher precision
○ lower recall
Precision
● Looks at True Positives and False
Negatives
● Put all guilty in jail
● Lower threshold
○ more predicted positives
○ higher recall
○ lower precision
Recall

F1 score vs Accuracy
Common This vs That
● Doesn’t look at True Negatives
● You care more about the positive
class
● Balances Precision and Recall
F1 score
● Looks at all elements from the
confusion matrix
● You care equally about positive
and negative class
● Very bad for imbalanced problems
Accuracy

ROC AUC vs PR AUC
Common This vs That
● Looks at all elements from the
confusion matrix (for all
thresholds)
● You care equally about positive
and negative class
● Data is not heavily imbalanced
ROC AUC
● Doesn’t look at True Negatives (for
all thresholds)
● You care more about positive
class
● Data is heavily imbalanced
PR AUC

F1 score vs ROC AUC
Common This vs That
● Class based metric -> you need to
choose a threshold
● You care more about positive than
negative class
● Doesn’t look at True Negatives
● Easier to communicate
F1 score
● Score based metric
● Good for ranking predictions
● You care both about positive and
negative class
● You data is not heavily imbalanced
ROC AUC

Materials
● Slides available on twitter (@NeptuneML) linkedin/slideshare (link)
● Github repository github.com/neptune-ml (link)
● Blog post “24 Evaluation Metrics for Binary Classiﬁcation (And
When to Use Them)”
● Blog post “F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which
Evaluation Metric Should You Choose?”
Extras

Metrics cheatsheet
● Download the cheatsheet (link)
Extras

Binary metrics logger
● Log binary metrics with one
function call
● pip install neptune-contrib (link)
● example (link)
Extras
import neptune
import neptunecontrib.monitoring.metrics as npt_metrics
neptune.create_experiment()
npt_metrics.log_binary_classification_metrics(y_true, y_pred)

Fairness metrics logger
● Log fairness metrics with one function call
● pip install neptune-contrib (link)
● example (link)
Extras
import neptune
import neptunecontrib.monitoring.fairness as npt_fair
neptune.create_experiment()
npt_fair.log_fairness_classification_metrics(
y_true, y_pred, y_class, x_protected,
favorable_label=0, unfavorable_label=1,
privileged_groups={'race':[3]},
unprivileged_groups={'race':[1,2,4]})

The most lightweight experiment
management tool that ﬁts any workﬂow.
jakub@neptune.ml
@NeptuneML
https://neptune.ml/blog
Jakub Czakon

Binary Classification Metrics: The Ultimate Guide for Evaluating Model Performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Binary Classification Metrics: The Ultimate Guide for Evaluating Model Performance

Similar to Binary Classification Metrics: The Ultimate Guide for Evaluating Model Performance (20)

Recently uploaded

Recently uploaded (20)

Binary Classification Metrics: The Ultimate Guide for Evaluating Model Performance