SlideShare a Scribd company logo
1 of 82
Download to read offline
Evaluation Metrics for Binary
Classification: The Ultimate Guide
jakub@neptune.ml
@NeptuneML
https://neptune.ml/blog
Jakub Czakon
● Intro
● Class-based metrics
● Score-based metrics
● Common “This vs That”
● Extras
Agenda
Intro
Intro
Evaluation Metric:
● is a model performance indicator/proxy
● (strongly) depends on the problem
● rarely maps 1-1 to your business problem
● is not a guarantee of performance on other metrics
Intro
Class-based metrics Score-based metrics
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)
y_test_pred = model.predict_proba(X_test)
y_test_class = y_test_pred > threshold
metric(y_test_true, y_test_class)
metric(y_test_true, y_test_pred)
Class-based metrics
Class-based metrics
True Negative False Negative
True PositiveFalse Positive
What is it?
Confusion Matrix
Confusion Matrix
Class-based metrics
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
fig, ax = plt.subplots()
cm = confusion_matrix(y_true, y_pred_class)
sns.heatmap(cm, cmap=plt.get_cmap('Blues'), annot=True, fmt='g', ax=ax)
ax.set_xlabel('predicted values')
ax.set_ylabel('actual values')
How to plot it?
Confusion Matrix
Class-based metrics
● Pretty much always
● I prefer nominal values to normalized (can see problems)
When to use it?
False Positive Rate | Type I error
Class-based metrics
What is it?
● When we predict something that isn’t
● Fraction of false alerts
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
False Positive Rate | Type I error
Class-based metrics
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true,
y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)
How to calculate it?
False Positive Rate | Type I error
Class-based metrics
How to choose a threshold?
False Positive Rate | Type I error
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● if the cost of dealing with an alert is high
False Negative Rate | Type II error
Class-based metrics
What is it?
● When we don’t predict something
when it is
● fraction of missed fraudulent
transactions True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)
False Negative Rate | Type II error
Class-based metrics
How to calculate it?
False Negative Rate | Type II error
Class-based metrics
How to choose a threshold?
False Negative Rate | Type II error
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● if the cost of dealing with an alert is high
True Negative Rate | Specificity
Class-based metrics
What is it?
● how good we are at predicting
negative class
● same axis as False Positive Rate
● How many non-fraudulent transactions
marked as clean
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
True Negative Rate | Specificity
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_negative_rate = tn / (tn + fp)
True Negative Rate | Specificity
Class-based metrics
How to choose a threshold?
True Negative Rate | Specificity
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● When you want to feel good when you say
“you are healthy” or “this transaction is clean”
True Positive Rate | Recall | Sensitivity
Class-based metrics
What is it?
● how good we are at finding positive
class members
● put all guilty in prison
● same axis as False Negative Rate True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
True Positive Rate | Recall | Sensitivity
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix, recall_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_positive_rate = tp / (tp + fn)
# or simply
recall_score(y_true, y_pred_class)
True Positive Rate | Recall | Sensitivity
Class-based metrics
How to choose a threshold?
True Positive Rate | Recall | Sensitivity
Class-based metrics
When to use it?
● Rarely used alone but can be auxiliary metric
● You want to catch all fraudulent transactions
● False alerts are cheap to process
Positive Predictive Value | Precision
Class-based metrics
What is it?
● how accurate are we when we say
positive class
● Only guilty people should be in prison
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Positive Predictive Value | Precision
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix, precision_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
positive_predictive_value = tp/ (tp + fp)
# or simply
precision_score(y_true, y_pred_class)
Positive Predictive Value | Precision
Class-based metrics
How to choose a threshold?
Class-based metrics
When to use it?
● Rarely used alone but can be auxiliary metric
● False alerts are expensive to process
● You want to catch only fraudulent transactions
Positive Predictive Value | Precision
Accuracy
Class-based metrics
What is it?
● Fraction of correctly classified
observations (positive and negative)
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Accuracy
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
accuracy = (tp + tn) / (tp + fp + fn + tn)
# or simply
accuracy_score(y_true, y_pred_class)
Accuracy
Class-based metrics
How to choose a threshold?
Accuracy
Class-based metrics
When to use it?
● When your problem is balanced
● When every class is equally important to you
● When you need something easy-to-explain to stakeholders
F score
Class-based metrics
What is it?
● Combines Precision and Recall into one
score
● Weighted harmonic mean
● Doesn’t care about True Negatives
● F1 score (beta=1)-> harmonic mean
● F2 score (beta=2)-> 2x emphasis on recall
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
F score
Class-based metrics
How to calculate it?
from sklearn.metrics import fbeta_score
y_pred_class = y_pred_pos > threshold
fbeta_score(y_true, y_pred_class, beta)
F score
Class-based metrics
How to choose a threshold?
F score
Class-based metrics
How to choose a threshold?
F score
Class-based metrics
How to choose a threshold?
F score
Class-based metrics
When to use it?
F1 score
● my go-to metric for binary classification
● easy-to-explain to stakeholders
F2 score
● When you need to adjust precision recall tradeoff
● when finding positive fraudulent transactions is more important than being correct about it
Cohen Kappa
Class-based metrics
What is it?
● how much than a random classifier
your model is
● Observed agreement p0 : accuracy
● Expected agreement pe : accuracy of
the random classifier
● Random classifier: samples randomly
according to class frequencies
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Cohen Kappa
Class-based metrics
How to calculate it?
from sklearn.metrics import
cohen_kappa_score
y_pred_class = y_pred_pos > threshold
cohen_kappa_score(y_true, y_pred_class)
Cohen Kappa
Class-based metrics
How to choose a threshold?
Cohen Kappa
Class-based metrics
When to use it?
● Unpopular metric for classification
● Works well for unbalanced problems
● Good substitute for accuracy when you need a
metric that is easy to explain
Matthews Correlation Coefficient
Class-based metrics
What is it?
● Correlation between predicted classes
and the ground truth labels
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Matthews Correlation Coefficient
Class-based metrics
How to calculate it?
from sklearn.metrics import matthews_corrcoef
y_pred_class = y_pred_pos > threshold
matthews_corrcoef(y_true, y_pred_class)
Matthews Correlation Coefficient
Class-based metrics
How to choose a threshold?
Matthews Correlation Coefficient
Class-based metrics
When to use it?
● Unpopular metric for classification
● Works well for unbalanced problems
● Good substitute for accuracy when you need
a metric that is easy to explain
Dollar-focused metrics
Class-based metrics
● Get the cost of False Negative (letting fraudulent transaction)
● Get the cost of False Positive (blocking of clean transaction)
● Find optimal threshold in dollars $ -> optimize business
problem directly
● Blog post from Airbnb (link)
Fairness metrics
Class-based metrics
● Divide your dataset into groups based on protected feature (race, sex, etc) ->
get privileged and unprivileged groups
● Calculate True Positive Rate for both groups
● Calculate the difference in TPR -> get Equality of Opportunity Metric
● Blog post on fairness (link)
Score-based metrics
ROC curve
Score-based metrics
What is it?
● visualizes the tradeoff between true
positive rate (TPR) and false positive
rate (FPR).
● for every threshold, we calculate TPR
and FPR and plot it on one chart.
ROC curve
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
ROC curve
Score-based metrics
When to use it?
● You want to see model performance over all thresholds
● Want to visually compare multiple models
● Care equally about both positive and negative class
ROC AUC score
Score-based metrics
What is it?
● One number that summarizes ROC curve
● Area under the ROC curve (integral)
● Alternatively, rank correlation between
predictions and targets (link)
● Is looking at the entire confusion matrix
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
ROC AUC score
Score-based metrics
How to calculate it?
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
ROC AUC score
Score-based metrics
When to use it?
● You care about ranking predictions (not about getting calibrated probabilities)
● Do not use when data heavily imbalanced and you care only about the positive class (link)
● When you care equally about the positive and negative classes
Precision Recall curve
Score-based metrics
What is it?
● visualizes the tradeoff between
precision and recall
● for every threshold, we calculate
precision and recall and plot it on one
chart
Precision Recall curve
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_precision_recall
fig, ax = plt.subplots()
plot_precision_recall(y_true, y_pred, ax=ax)
Precision Recall curve
Score-based metrics
When to use it?
● You want to see model performance over all thresholds
● Want to visually compare multiple models
● Want to find a good threshold for class assignment
● Care more about the positive class
PR AUC score | Average precision
Score-based metrics
What is it?
● One number that summarizes Precision
Recall curve
● Area under the Precision Recall curve
(integral)
● Doesn’t look at True Negatives!
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
PR AUC score | Average precision
Score-based metrics
How to calculate it?
from sklearn.metrics import average_precision_score
average_precision_score(y_true, y_pred_pos)
PR AUC score | Average precision
Score-based metrics
When to use it?
● Data is heavily imbalanced and you care only about the positive class (link)
● You care more about the positive than negative class
● you want to choose the threshold that fits the business problem
● you want to communicate precision/recall decision
Brier score
Score-based metrics
What is it?
● Distance from your predictions to
the true value (label)
Brier score
Score-based metrics
How to calculate it?
from sklearn.metrics import
brier_score_loss
brier_score_loss(y_true, y_pred_pos)
Brier score
Score-based metrics
When to use it?
● When you care about calibrated probabilities
● Why care about calibration?
y_true = [0, 1, 1, 0, 1, 1, 1, 0]
y_pred_v1 = [0.28, 0.35, 0.32, 0.29, 0.34, 0.38, 0.37, 0.31]
y_pred_v2 = [0.18, 0.95, 0.92, 0.19, 0.94, 0.98, 0.97, 0.21]
1.000, 1.000
0.295, 0.0158
roc_auc_score(y_true, y_pred_v1), roc_auc_score(y_true, y_pred_v2)
brier_score_loss(y_true, y_pred_v1), brier_score_loss(y_true, y_pred_v2)
Cumulative gain chart
Score-based metrics
What is it?
● Shows how much your model gains
over the random model
● Calculate it by:
○ Order predictions
○ Calculate fraction of True
Positives for your model and
random model
○ Plot them on one chart
Cumulative gain chart
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_cumulative_gain
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
Cumulative gain chart
Score-based metrics
When to use it?
● You want to select the most promising customers transactions from a dataset
● You are looking for a good cutoff point
● Can be a good addition to ROC AUC score
Lift chart
Score-based metrics
What is it?
● Shows how much your model gains over
the random model
● Calculate it by:
○ Order predictions
○ Calculate fraction of True Positives
for your model and random model
○ Calculate the ratio of your model
gains over random model
○ Plot them on one chart
Lift chart
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_cumulative_gain
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
Lift chart
Score-based metrics
When to use it?
● You want to select the most promising customers transactions from a dataset
● You are looking for a good cutoff point
● Can be a good addition to ROC AUC score
Common This vs That
Precision vs Recall
Common This vs That
● Looks at True Positives and False
Positives
● Only guilty should be in jail
● Higher threshold
○ less predicted positives
○ higher precision
○ lower recall
Precision
● Looks at True Positives and False
Negatives
● Put all guilty in jail
● Lower threshold
○ more predicted positives
○ higher recall
○ lower precision
Recall
F1 score vs Accuracy
Common This vs That
● Doesn’t look at True Negatives
● You care more about the positive
class
● Balances Precision and Recall
F1 score
● Looks at all elements from the
confusion matrix
● You care equally about positive
and negative class
● Very bad for imbalanced problems
Accuracy
ROC AUC vs PR AUC
Common This vs That
● Looks at all elements from the
confusion matrix (for all
thresholds)
● You care equally about positive
and negative class
● Data is not heavily imbalanced
ROC AUC
● Doesn’t look at True Negatives (for
all thresholds)
● You care more about positive
class
● Data is heavily imbalanced
PR AUC
F1 score vs ROC AUC
Common This vs That
● Class based metric -> you need to
choose a threshold
● You care more about positive than
negative class
● Doesn’t look at True Negatives
● Easier to communicate
F1 score
● Score based metric
● Good for ranking predictions
● You care both about positive and
negative class
● You data is not heavily imbalanced
ROC AUC
Extras
Materials
● Slides available on twitter (@NeptuneML) linkedin/slideshare (link)
● Github repository github.com/neptune-ml (link)
● Blog post “24 Evaluation Metrics for Binary Classification (And
When to Use Them)”
● Blog post “F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which
Evaluation Metric Should You Choose?”
Extras
Metrics cheatsheet
● Download the cheatsheet (link)
Extras
Binary metrics logger
● Log binary metrics with one
function call
● pip install neptune-contrib (link)
● example (link)
Extras
import neptune
import neptunecontrib.monitoring.metrics as npt_metrics
neptune.create_experiment()
npt_metrics.log_binary_classification_metrics(y_true, y_pred)
Fairness metrics logger
● Log fairness metrics with one function call
● pip install neptune-contrib (link)
● example (link)
Extras
import neptune
import neptunecontrib.monitoring.fairness as npt_fair
neptune.create_experiment()
npt_fair.log_fairness_classification_metrics(
y_true, y_pred, y_class, x_protected,
favorable_label=0, unfavorable_label=1,
privileged_groups={'race':[3]},
unprivileged_groups={'race':[1,2,4]})
The most lightweight experiment
management tool that fits any workflow.
jakub@neptune.ml
@NeptuneML
https://neptune.ml/blog
Jakub Czakon

More Related Content

What's hot

連続変量を含む相互情報量の推定
連続変量を含む相互情報量の推定連続変量を含む相互情報量の推定
連続変量を含む相互情報量の推定Joe Suzuki
 
Rselenium Dockerとの接続
Rselenium Dockerとの接続Rselenium Dockerとの接続
Rselenium Dockerとの接続Sora Kubota
 
金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデルKei Nakagawa
 
セクシー女優で学ぶ画像分類入門
セクシー女優で学ぶ画像分類入門セクシー女優で学ぶ画像分類入門
セクシー女優で学ぶ画像分類入門Takami Sato
 
Media Art II openFrameworks 複数のシーンの管理・切替え
Media Art II openFrameworks 複数のシーンの管理・切替えMedia Art II openFrameworks 複数のシーンの管理・切替え
Media Art II openFrameworks 複数のシーンの管理・切替えAtsushi Tadokoro
 
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Kei Nakagawa
 
ロボティクスから探る乳幼児の社会性認知発達(長井 志江)
ロボティクスから探る乳幼児の社会性認知発達(長井 志江)ロボティクスから探る乳幼児の社会性認知発達(長井 志江)
ロボティクスから探る乳幼児の社会性認知発達(長井 志江)KIT Cognitive Interaction Design
 
AI 9 | Bayesian Network and Probabilistic Inference
AI 9 | Bayesian Network and Probabilistic InferenceAI 9 | Bayesian Network and Probabilistic Inference
AI 9 | Bayesian Network and Probabilistic InferenceMohammad Imam Hossain
 
Rで項目反応理論、テキストマイニング、Rの研修やってますという三題噺(33rd #TokyoR)
Rで項目反応理論、テキストマイニング、Rの研修やってますという三題噺(33rd #TokyoR)Rで項目反応理論、テキストマイニング、Rの研修やってますという三題噺(33rd #TokyoR)
Rで項目反応理論、テキストマイニング、Rの研修やってますという三題噺(33rd #TokyoR)Kenta Tanaka
 
MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」
MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」
MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」moterech
 
Jetson活用セミナー ROS2自律走行実現に向けて
Jetson活用セミナー ROS2自律走行実現に向けてJetson活用セミナー ROS2自律走行実現に向けて
Jetson活用セミナー ROS2自律走行実現に向けてFixstars Corporation
 
Kaggle RSNA Pneumonia Detection Challenge 解法紹介
Kaggle RSNA Pneumonia Detection Challenge 解法紹介Kaggle RSNA Pneumonia Detection Challenge 解法紹介
Kaggle RSNA Pneumonia Detection Challenge 解法紹介理 秋山
 
3次元レジストレーション(PCLデモとコード付き)
3次元レジストレーション(PCLデモとコード付き)3次元レジストレーション(PCLデモとコード付き)
3次元レジストレーション(PCLデモとコード付き)Toru Tamaki
 
探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディット探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディットH Okazaki
 
NDTスキャンマッチング 第1回3D勉強会@PFN 2018年5月27日
NDTスキャンマッチング 第1回3D勉強会@PFN 2018年5月27日NDTスキャンマッチング 第1回3D勉強会@PFN 2018年5月27日
NDTスキャンマッチング 第1回3D勉強会@PFN 2018年5月27日Kitsukawa Yuki
 
逐次モンテカルロ法の基礎
逐次モンテカルロ法の基礎逐次モンテカルロ法の基礎
逐次モンテカルロ法の基礎ShoutoYonekura
 

What's hot (20)

連続変量を含む相互情報量の推定
連続変量を含む相互情報量の推定連続変量を含む相互情報量の推定
連続変量を含む相互情報量の推定
 
Rselenium Dockerとの接続
Rselenium Dockerとの接続Rselenium Dockerとの接続
Rselenium Dockerとの接続
 
金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル金融時系列のための深層t過程回帰モデル
金融時系列のための深層t過程回帰モデル
 
SLAM勉強会(PTAM)
SLAM勉強会(PTAM)SLAM勉強会(PTAM)
SLAM勉強会(PTAM)
 
セクシー女優で学ぶ画像分類入門
セクシー女優で学ぶ画像分類入門セクシー女優で学ぶ画像分類入門
セクシー女優で学ぶ画像分類入門
 
Media Art II openFrameworks 複数のシーンの管理・切替え
Media Art II openFrameworks 複数のシーンの管理・切替えMedia Art II openFrameworks 複数のシーンの管理・切替え
Media Art II openFrameworks 複数のシーンの管理・切替え
 
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...Stock price prediction using k* nearest neighbors and indexing dynamic time w...
Stock price prediction using k* nearest neighbors and indexing dynamic time w...
 
ロボティクスから探る乳幼児の社会性認知発達(長井 志江)
ロボティクスから探る乳幼児の社会性認知発達(長井 志江)ロボティクスから探る乳幼児の社会性認知発達(長井 志江)
ロボティクスから探る乳幼児の社会性認知発達(長井 志江)
 
AI 9 | Bayesian Network and Probabilistic Inference
AI 9 | Bayesian Network and Probabilistic InferenceAI 9 | Bayesian Network and Probabilistic Inference
AI 9 | Bayesian Network and Probabilistic Inference
 
Rで項目反応理論、テキストマイニング、Rの研修やってますという三題噺(33rd #TokyoR)
Rで項目反応理論、テキストマイニング、Rの研修やってますという三題噺(33rd #TokyoR)Rで項目反応理論、テキストマイニング、Rの研修やってますという三題噺(33rd #TokyoR)
Rで項目反応理論、テキストマイニング、Rの研修やってますという三題噺(33rd #TokyoR)
 
MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」
MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」
MLaPP 24章 「マルコフ連鎖モンテカルロ法 (MCMC) による推論」
 
Jetson活用セミナー ROS2自律走行実現に向けて
Jetson活用セミナー ROS2自律走行実現に向けてJetson活用セミナー ROS2自律走行実現に向けて
Jetson活用セミナー ROS2自律走行実現に向けて
 
Digital Mapping
Digital MappingDigital Mapping
Digital Mapping
 
Kaggle RSNA Pneumonia Detection Challenge 解法紹介
Kaggle RSNA Pneumonia Detection Challenge 解法紹介Kaggle RSNA Pneumonia Detection Challenge 解法紹介
Kaggle RSNA Pneumonia Detection Challenge 解法紹介
 
HIDDEN MARKOV MODEL AND ITS APPLICATION
HIDDEN MARKOV MODEL AND ITS APPLICATIONHIDDEN MARKOV MODEL AND ITS APPLICATION
HIDDEN MARKOV MODEL AND ITS APPLICATION
 
3次元レジストレーション(PCLデモとコード付き)
3次元レジストレーション(PCLデモとコード付き)3次元レジストレーション(PCLデモとコード付き)
3次元レジストレーション(PCLデモとコード付き)
 
探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディット探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディット
 
Lesson from 21cm line signal
Lesson from 21cm line signalLesson from 21cm line signal
Lesson from 21cm line signal
 
NDTスキャンマッチング 第1回3D勉強会@PFN 2018年5月27日
NDTスキャンマッチング 第1回3D勉強会@PFN 2018年5月27日NDTスキャンマッチング 第1回3D勉強会@PFN 2018年5月27日
NDTスキャンマッチング 第1回3D勉強会@PFN 2018年5月27日
 
逐次モンテカルロ法の基礎
逐次モンテカルロ法の基礎逐次モンテカルロ法の基礎
逐次モンテカルロ法の基礎
 

Similar to Binary Classification Metrics: The Ultimate Guide for Evaluating Model Performance

Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheetJakub Czakon
 
Recommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model EvaluationRecommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model EvaluationCrossing Minds
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Md. Main Uddin Rony
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with YellowbrickRebecca Bilbro
 
Lecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptxLecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptxGauravSonawane51
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptxshuchismitjha2
 
Supervised learning
Supervised learningSupervised learning
Supervised learningJohnson Ubah
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxajondaree
 
Sap sd important interview concepts
Sap sd important interview concepts Sap sd important interview concepts
Sap sd important interview concepts Mohit Amitabh
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)Abhimanyu Dwivedi
 
Yelp dataset challenge
Yelp dataset challengeYelp dataset challenge
Yelp dataset challengeArnabKMCS156
 
Yelp dataset challenge
Yelp dataset challengeYelp dataset challenge
Yelp dataset challengearnkmish
 
Classification
ClassificationClassification
ClassificationCloudxLab
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedDataminingTools Inc
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learnedweka Content
 
Data analytics in fraud detection and customer feedback
Data analytics in fraud detection and customer feedbackData analytics in fraud detection and customer feedback
Data analytics in fraud detection and customer feedbackAnkit Jain
 
Download the presentation
Download the presentationDownload the presentation
Download the presentationbutest
 

Similar to Binary Classification Metrics: The Ultimate Guide for Evaluating Model Performance (20)

Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheet
 
Recommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model EvaluationRecommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model Evaluation
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Lecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptxLecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptx
 
Mr4 ms10
Mr4 ms10Mr4 ms10
Mr4 ms10
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
 
Sap sd important interview concepts
Sap sd important interview concepts Sap sd important interview concepts
Sap sd important interview concepts
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)
 
Yelp dataset challenge
Yelp dataset challengeYelp dataset challenge
Yelp dataset challenge
 
Yelp dataset challenge
Yelp dataset challengeYelp dataset challenge
Yelp dataset challenge
 
Classification
ClassificationClassification
Classification
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
 
JA, PA, Selection 2016
JA, PA, Selection 2016JA, PA, Selection 2016
JA, PA, Selection 2016
 
Data analytics in fraud detection and customer feedback
Data analytics in fraud detection and customer feedbackData analytics in fraud detection and customer feedback
Data analytics in fraud detection and customer feedback
 
Download the presentation
Download the presentationDownload the presentation
Download the presentation
 
1000 track2 Bharadwaj
1000 track2 Bharadwaj1000 track2 Bharadwaj
1000 track2 Bharadwaj
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

Binary Classification Metrics: The Ultimate Guide for Evaluating Model Performance

  • 1. Evaluation Metrics for Binary Classification: The Ultimate Guide jakub@neptune.ml @NeptuneML https://neptune.ml/blog Jakub Czakon
  • 2. ● Intro ● Class-based metrics ● Score-based metrics ● Common “This vs That” ● Extras Agenda
  • 4. Intro Evaluation Metric: ● is a model performance indicator/proxy ● (strongly) depends on the problem ● rarely maps 1-1 to your business problem ● is not a guarantee of performance on other metrics
  • 5. Intro Class-based metrics Score-based metrics model = lightgbm.LGBMClassifier(**MODEL_PARAMS) model.fit(X_train, y_train) y_test_pred = model.predict_proba(X_test) y_test_class = y_test_pred > threshold metric(y_test_true, y_test_class) metric(y_test_true, y_test_pred)
  • 7. Class-based metrics True Negative False Negative True PositiveFalse Positive What is it? Confusion Matrix
  • 8. Confusion Matrix Class-based metrics import matplotlib.pyplot as plt import seaborn as sns from sklearn.metrics import confusion_matrix fig, ax = plt.subplots() cm = confusion_matrix(y_true, y_pred_class) sns.heatmap(cm, cmap=plt.get_cmap('Blues'), annot=True, fmt='g', ax=ax) ax.set_xlabel('predicted values') ax.set_ylabel('actual values') How to plot it?
  • 9. Confusion Matrix Class-based metrics ● Pretty much always ● I prefer nominal values to normalized (can see problems) When to use it?
  • 10. False Positive Rate | Type I error Class-based metrics What is it? ● When we predict something that isn’t ● Fraction of false alerts True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 11. False Positive Rate | Type I error Class-based metrics from sklearn.metrics import confusion_matrix y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() false_positive_rate = fp / (fp + tn) How to calculate it?
  • 12. False Positive Rate | Type I error Class-based metrics How to choose a threshold?
  • 13. False Positive Rate | Type I error Class-based metrics When to use it? ● rarely used alone but can be auxiliary metric ● if the cost of dealing with an alert is high
  • 14. False Negative Rate | Type II error Class-based metrics What is it? ● When we don’t predict something when it is ● fraction of missed fraudulent transactions True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 15. from sklearn.metrics import confusion_matrix y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() false_negative_rate = fn / (tp + fn) False Negative Rate | Type II error Class-based metrics How to calculate it?
  • 16. False Negative Rate | Type II error Class-based metrics How to choose a threshold?
  • 17. False Negative Rate | Type II error Class-based metrics When to use it? ● rarely used alone but can be auxiliary metric ● if the cost of dealing with an alert is high
  • 18. True Negative Rate | Specificity Class-based metrics What is it? ● how good we are at predicting negative class ● same axis as False Positive Rate ● How many non-fraudulent transactions marked as clean True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 19. True Negative Rate | Specificity Class-based metrics How to calculate it? from sklearn.metrics import confusion_matrix y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() true_negative_rate = tn / (tn + fp)
  • 20. True Negative Rate | Specificity Class-based metrics How to choose a threshold?
  • 21. True Negative Rate | Specificity Class-based metrics When to use it? ● rarely used alone but can be auxiliary metric ● When you want to feel good when you say “you are healthy” or “this transaction is clean”
  • 22. True Positive Rate | Recall | Sensitivity Class-based metrics What is it? ● how good we are at finding positive class members ● put all guilty in prison ● same axis as False Negative Rate True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 23. True Positive Rate | Recall | Sensitivity Class-based metrics How to calculate it? from sklearn.metrics import confusion_matrix, recall_score y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() true_positive_rate = tp / (tp + fn) # or simply recall_score(y_true, y_pred_class)
  • 24. True Positive Rate | Recall | Sensitivity Class-based metrics How to choose a threshold?
  • 25. True Positive Rate | Recall | Sensitivity Class-based metrics When to use it? ● Rarely used alone but can be auxiliary metric ● You want to catch all fraudulent transactions ● False alerts are cheap to process
  • 26. Positive Predictive Value | Precision Class-based metrics What is it? ● how accurate are we when we say positive class ● Only guilty people should be in prison True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 27. Positive Predictive Value | Precision Class-based metrics How to calculate it? from sklearn.metrics import confusion_matrix, precision_score y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() positive_predictive_value = tp/ (tp + fp) # or simply precision_score(y_true, y_pred_class)
  • 28. Positive Predictive Value | Precision Class-based metrics How to choose a threshold?
  • 29. Class-based metrics When to use it? ● Rarely used alone but can be auxiliary metric ● False alerts are expensive to process ● You want to catch only fraudulent transactions Positive Predictive Value | Precision
  • 30. Accuracy Class-based metrics What is it? ● Fraction of correctly classified observations (positive and negative) True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 31. Accuracy Class-based metrics How to calculate it? from sklearn.metrics import confusion_matrix, accuracy_score y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() accuracy = (tp + tn) / (tp + fp + fn + tn) # or simply accuracy_score(y_true, y_pred_class)
  • 32. Accuracy Class-based metrics How to choose a threshold?
  • 33. Accuracy Class-based metrics When to use it? ● When your problem is balanced ● When every class is equally important to you ● When you need something easy-to-explain to stakeholders
  • 34. F score Class-based metrics What is it? ● Combines Precision and Recall into one score ● Weighted harmonic mean ● Doesn’t care about True Negatives ● F1 score (beta=1)-> harmonic mean ● F2 score (beta=2)-> 2x emphasis on recall True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 35. F score Class-based metrics How to calculate it? from sklearn.metrics import fbeta_score y_pred_class = y_pred_pos > threshold fbeta_score(y_true, y_pred_class, beta)
  • 36. F score Class-based metrics How to choose a threshold?
  • 37. F score Class-based metrics How to choose a threshold?
  • 38. F score Class-based metrics How to choose a threshold?
  • 39. F score Class-based metrics When to use it? F1 score ● my go-to metric for binary classification ● easy-to-explain to stakeholders F2 score ● When you need to adjust precision recall tradeoff ● when finding positive fraudulent transactions is more important than being correct about it
  • 40. Cohen Kappa Class-based metrics What is it? ● how much than a random classifier your model is ● Observed agreement p0 : accuracy ● Expected agreement pe : accuracy of the random classifier ● Random classifier: samples randomly according to class frequencies True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 41. Cohen Kappa Class-based metrics How to calculate it? from sklearn.metrics import cohen_kappa_score y_pred_class = y_pred_pos > threshold cohen_kappa_score(y_true, y_pred_class)
  • 42. Cohen Kappa Class-based metrics How to choose a threshold?
  • 43. Cohen Kappa Class-based metrics When to use it? ● Unpopular metric for classification ● Works well for unbalanced problems ● Good substitute for accuracy when you need a metric that is easy to explain
  • 44. Matthews Correlation Coefficient Class-based metrics What is it? ● Correlation between predicted classes and the ground truth labels True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 45. Matthews Correlation Coefficient Class-based metrics How to calculate it? from sklearn.metrics import matthews_corrcoef y_pred_class = y_pred_pos > threshold matthews_corrcoef(y_true, y_pred_class)
  • 46. Matthews Correlation Coefficient Class-based metrics How to choose a threshold?
  • 47. Matthews Correlation Coefficient Class-based metrics When to use it? ● Unpopular metric for classification ● Works well for unbalanced problems ● Good substitute for accuracy when you need a metric that is easy to explain
  • 48. Dollar-focused metrics Class-based metrics ● Get the cost of False Negative (letting fraudulent transaction) ● Get the cost of False Positive (blocking of clean transaction) ● Find optimal threshold in dollars $ -> optimize business problem directly ● Blog post from Airbnb (link)
  • 49. Fairness metrics Class-based metrics ● Divide your dataset into groups based on protected feature (race, sex, etc) -> get privileged and unprivileged groups ● Calculate True Positive Rate for both groups ● Calculate the difference in TPR -> get Equality of Opportunity Metric ● Blog post on fairness (link)
  • 51. ROC curve Score-based metrics What is it? ● visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). ● for every threshold, we calculate TPR and FPR and plot it on one chart.
  • 52. ROC curve Score-based metrics How to plot it? from scikitplot.metrics import plot_roc fig, ax = plt.subplots() plot_roc(y_true, y_pred, ax=ax)
  • 53. ROC curve Score-based metrics When to use it? ● You want to see model performance over all thresholds ● Want to visually compare multiple models ● Care equally about both positive and negative class
  • 54. ROC AUC score Score-based metrics What is it? ● One number that summarizes ROC curve ● Area under the ROC curve (integral) ● Alternatively, rank correlation between predictions and targets (link) ● Is looking at the entire confusion matrix True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 55. ROC AUC score Score-based metrics How to calculate it? from scikitplot.metrics import plot_roc fig, ax = plt.subplots() plot_roc(y_true, y_pred, ax=ax)
  • 56. ROC AUC score Score-based metrics When to use it? ● You care about ranking predictions (not about getting calibrated probabilities) ● Do not use when data heavily imbalanced and you care only about the positive class (link) ● When you care equally about the positive and negative classes
  • 57. Precision Recall curve Score-based metrics What is it? ● visualizes the tradeoff between precision and recall ● for every threshold, we calculate precision and recall and plot it on one chart
  • 58. Precision Recall curve Score-based metrics How to plot it? from scikitplot.metrics import plot_precision_recall fig, ax = plt.subplots() plot_precision_recall(y_true, y_pred, ax=ax)
  • 59. Precision Recall curve Score-based metrics When to use it? ● You want to see model performance over all thresholds ● Want to visually compare multiple models ● Want to find a good threshold for class assignment ● Care more about the positive class
  • 60. PR AUC score | Average precision Score-based metrics What is it? ● One number that summarizes Precision Recall curve ● Area under the Precision Recall curve (integral) ● Doesn’t look at True Negatives! True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 61. PR AUC score | Average precision Score-based metrics How to calculate it? from sklearn.metrics import average_precision_score average_precision_score(y_true, y_pred_pos)
  • 62. PR AUC score | Average precision Score-based metrics When to use it? ● Data is heavily imbalanced and you care only about the positive class (link) ● You care more about the positive than negative class ● you want to choose the threshold that fits the business problem ● you want to communicate precision/recall decision
  • 63. Brier score Score-based metrics What is it? ● Distance from your predictions to the true value (label)
  • 64. Brier score Score-based metrics How to calculate it? from sklearn.metrics import brier_score_loss brier_score_loss(y_true, y_pred_pos)
  • 65. Brier score Score-based metrics When to use it? ● When you care about calibrated probabilities ● Why care about calibration? y_true = [0, 1, 1, 0, 1, 1, 1, 0] y_pred_v1 = [0.28, 0.35, 0.32, 0.29, 0.34, 0.38, 0.37, 0.31] y_pred_v2 = [0.18, 0.95, 0.92, 0.19, 0.94, 0.98, 0.97, 0.21] 1.000, 1.000 0.295, 0.0158 roc_auc_score(y_true, y_pred_v1), roc_auc_score(y_true, y_pred_v2) brier_score_loss(y_true, y_pred_v1), brier_score_loss(y_true, y_pred_v2)
  • 66. Cumulative gain chart Score-based metrics What is it? ● Shows how much your model gains over the random model ● Calculate it by: ○ Order predictions ○ Calculate fraction of True Positives for your model and random model ○ Plot them on one chart
  • 67. Cumulative gain chart Score-based metrics How to plot it? from scikitplot.metrics import plot_cumulative_gain fig, ax = plt.subplots() plot_cumulative_gain(y_true, y_pred, ax=ax)
  • 68. Cumulative gain chart Score-based metrics When to use it? ● You want to select the most promising customers transactions from a dataset ● You are looking for a good cutoff point ● Can be a good addition to ROC AUC score
  • 69. Lift chart Score-based metrics What is it? ● Shows how much your model gains over the random model ● Calculate it by: ○ Order predictions ○ Calculate fraction of True Positives for your model and random model ○ Calculate the ratio of your model gains over random model ○ Plot them on one chart
  • 70. Lift chart Score-based metrics How to plot it? from scikitplot.metrics import plot_cumulative_gain fig, ax = plt.subplots() plot_cumulative_gain(y_true, y_pred, ax=ax)
  • 71. Lift chart Score-based metrics When to use it? ● You want to select the most promising customers transactions from a dataset ● You are looking for a good cutoff point ● Can be a good addition to ROC AUC score
  • 73. Precision vs Recall Common This vs That ● Looks at True Positives and False Positives ● Only guilty should be in jail ● Higher threshold ○ less predicted positives ○ higher precision ○ lower recall Precision ● Looks at True Positives and False Negatives ● Put all guilty in jail ● Lower threshold ○ more predicted positives ○ higher recall ○ lower precision Recall
  • 74. F1 score vs Accuracy Common This vs That ● Doesn’t look at True Negatives ● You care more about the positive class ● Balances Precision and Recall F1 score ● Looks at all elements from the confusion matrix ● You care equally about positive and negative class ● Very bad for imbalanced problems Accuracy
  • 75. ROC AUC vs PR AUC Common This vs That ● Looks at all elements from the confusion matrix (for all thresholds) ● You care equally about positive and negative class ● Data is not heavily imbalanced ROC AUC ● Doesn’t look at True Negatives (for all thresholds) ● You care more about positive class ● Data is heavily imbalanced PR AUC
  • 76. F1 score vs ROC AUC Common This vs That ● Class based metric -> you need to choose a threshold ● You care more about positive than negative class ● Doesn’t look at True Negatives ● Easier to communicate F1 score ● Score based metric ● Good for ranking predictions ● You care both about positive and negative class ● You data is not heavily imbalanced ROC AUC
  • 78. Materials ● Slides available on twitter (@NeptuneML) linkedin/slideshare (link) ● Github repository github.com/neptune-ml (link) ● Blog post “24 Evaluation Metrics for Binary Classification (And When to Use Them)” ● Blog post “F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?” Extras
  • 79. Metrics cheatsheet ● Download the cheatsheet (link) Extras
  • 80. Binary metrics logger ● Log binary metrics with one function call ● pip install neptune-contrib (link) ● example (link) Extras import neptune import neptunecontrib.monitoring.metrics as npt_metrics neptune.create_experiment() npt_metrics.log_binary_classification_metrics(y_true, y_pred)
  • 81. Fairness metrics logger ● Log fairness metrics with one function call ● pip install neptune-contrib (link) ● example (link) Extras import neptune import neptunecontrib.monitoring.fairness as npt_fair neptune.create_experiment() npt_fair.log_fairness_classification_metrics( y_true, y_pred, y_class, x_protected, favorable_label=0, unfavorable_label=1, privileged_groups={'race':[3]}, unprivileged_groups={'race':[1,2,4]})
  • 82. The most lightweight experiment management tool that fits any workflow. jakub@neptune.ml @NeptuneML https://neptune.ml/blog Jakub Czakon