SlideShare a Scribd company logo
Evaluation Metrics for Binary
Classification: The Ultimate Guide
jakub@neptune.ml
@NeptuneML
https://neptune.ml/blog
Jakub Czakon
● Intro
● Class-based metrics
● Score-based metrics
● Common “This vs That”
● Extras
Agenda
Intro
Intro
Evaluation Metric:
● is a model performance indicator/proxy
● (strongly) depends on the problem
● rarely maps 1-1 to your business problem
● is not a guarantee of performance on other metrics
Intro
Class-based metrics Score-based metrics
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)
y_test_pred = model.predict_proba(X_test)
y_test_class = y_test_pred > threshold
metric(y_test_true, y_test_class)
metric(y_test_true, y_test_pred)
Class-based metrics
Class-based metrics
True Negative False Negative
True PositiveFalse Positive
What is it?
Confusion Matrix
Confusion Matrix
Class-based metrics
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
fig, ax = plt.subplots()
cm = confusion_matrix(y_true, y_pred_class)
sns.heatmap(cm, cmap=plt.get_cmap('Blues'), annot=True, fmt='g', ax=ax)
ax.set_xlabel('predicted values')
ax.set_ylabel('actual values')
How to plot it?
Confusion Matrix
Class-based metrics
● Pretty much always
● I prefer nominal values to normalized (can see problems)
When to use it?
False Positive Rate | Type I error
Class-based metrics
What is it?
● When we predict something that isn’t
● Fraction of false alerts
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
False Positive Rate | Type I error
Class-based metrics
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true,
y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)
How to calculate it?
False Positive Rate | Type I error
Class-based metrics
How to choose a threshold?
False Positive Rate | Type I error
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● if the cost of dealing with an alert is high
False Negative Rate | Type II error
Class-based metrics
What is it?
● When we don’t predict something
when it is
● fraction of missed fraudulent
transactions True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)
False Negative Rate | Type II error
Class-based metrics
How to calculate it?
False Negative Rate | Type II error
Class-based metrics
How to choose a threshold?
False Negative Rate | Type II error
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● if the cost of dealing with an alert is high
True Negative Rate | Specificity
Class-based metrics
What is it?
● how good we are at predicting
negative class
● same axis as False Positive Rate
● How many non-fraudulent transactions
marked as clean
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
True Negative Rate | Specificity
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_negative_rate = tn / (tn + fp)
True Negative Rate | Specificity
Class-based metrics
How to choose a threshold?
True Negative Rate | Specificity
Class-based metrics
When to use it?
● rarely used alone but can be auxiliary metric
● When you want to feel good when you say
“you are healthy” or “this transaction is clean”
True Positive Rate | Recall | Sensitivity
Class-based metrics
What is it?
● how good we are at finding positive
class members
● put all guilty in prison
● same axis as False Negative Rate True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
True Positive Rate | Recall | Sensitivity
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix, recall_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_positive_rate = tp / (tp + fn)
# or simply
recall_score(y_true, y_pred_class)
True Positive Rate | Recall | Sensitivity
Class-based metrics
How to choose a threshold?
True Positive Rate | Recall | Sensitivity
Class-based metrics
When to use it?
● Rarely used alone but can be auxiliary metric
● You want to catch all fraudulent transactions
● False alerts are cheap to process
Positive Predictive Value | Precision
Class-based metrics
What is it?
● how accurate are we when we say
positive class
● Only guilty people should be in prison
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Positive Predictive Value | Precision
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix, precision_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
positive_predictive_value = tp/ (tp + fp)
# or simply
precision_score(y_true, y_pred_class)
Positive Predictive Value | Precision
Class-based metrics
How to choose a threshold?
Class-based metrics
When to use it?
● Rarely used alone but can be auxiliary metric
● False alerts are expensive to process
● You want to catch only fraudulent transactions
Positive Predictive Value | Precision
Accuracy
Class-based metrics
What is it?
● Fraction of correctly classified
observations (positive and negative)
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Accuracy
Class-based metrics
How to calculate it?
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
accuracy = (tp + tn) / (tp + fp + fn + tn)
# or simply
accuracy_score(y_true, y_pred_class)
Accuracy
Class-based metrics
How to choose a threshold?
Accuracy
Class-based metrics
When to use it?
● When your problem is balanced
● When every class is equally important to you
● When you need something easy-to-explain to stakeholders
F score
Class-based metrics
What is it?
● Combines Precision and Recall into one
score
● Weighted harmonic mean
● Doesn’t care about True Negatives
● F1 score (beta=1)-> harmonic mean
● F2 score (beta=2)-> 2x emphasis on recall
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
F score
Class-based metrics
How to calculate it?
from sklearn.metrics import fbeta_score
y_pred_class = y_pred_pos > threshold
fbeta_score(y_true, y_pred_class, beta)
F score
Class-based metrics
How to choose a threshold?
F score
Class-based metrics
How to choose a threshold?
F score
Class-based metrics
How to choose a threshold?
F score
Class-based metrics
When to use it?
F1 score
● my go-to metric for binary classification
● easy-to-explain to stakeholders
F2 score
● When you need to adjust precision recall tradeoff
● when finding positive fraudulent transactions is more important than being correct about it
Cohen Kappa
Class-based metrics
What is it?
● how much than a random classifier
your model is
● Observed agreement p0 : accuracy
● Expected agreement pe : accuracy of
the random classifier
● Random classifier: samples randomly
according to class frequencies
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Cohen Kappa
Class-based metrics
How to calculate it?
from sklearn.metrics import
cohen_kappa_score
y_pred_class = y_pred_pos > threshold
cohen_kappa_score(y_true, y_pred_class)
Cohen Kappa
Class-based metrics
How to choose a threshold?
Cohen Kappa
Class-based metrics
When to use it?
● Unpopular metric for classification
● Works well for unbalanced problems
● Good substitute for accuracy when you need a
metric that is easy to explain
Matthews Correlation Coefficient
Class-based metrics
What is it?
● Correlation between predicted classes
and the ground truth labels
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
Matthews Correlation Coefficient
Class-based metrics
How to calculate it?
from sklearn.metrics import matthews_corrcoef
y_pred_class = y_pred_pos > threshold
matthews_corrcoef(y_true, y_pred_class)
Matthews Correlation Coefficient
Class-based metrics
How to choose a threshold?
Matthews Correlation Coefficient
Class-based metrics
When to use it?
● Unpopular metric for classification
● Works well for unbalanced problems
● Good substitute for accuracy when you need
a metric that is easy to explain
Dollar-focused metrics
Class-based metrics
● Get the cost of False Negative (letting fraudulent transaction)
● Get the cost of False Positive (blocking of clean transaction)
● Find optimal threshold in dollars $ -> optimize business
problem directly
● Blog post from Airbnb (link)
Fairness metrics
Class-based metrics
● Divide your dataset into groups based on protected feature (race, sex, etc) ->
get privileged and unprivileged groups
● Calculate True Positive Rate for both groups
● Calculate the difference in TPR -> get Equality of Opportunity Metric
● Blog post on fairness (link)
Score-based metrics
ROC curve
Score-based metrics
What is it?
● visualizes the tradeoff between true
positive rate (TPR) and false positive
rate (FPR).
● for every threshold, we calculate TPR
and FPR and plot it on one chart.
ROC curve
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
ROC curve
Score-based metrics
When to use it?
● You want to see model performance over all thresholds
● Want to visually compare multiple models
● Care equally about both positive and negative class
ROC AUC score
Score-based metrics
What is it?
● One number that summarizes ROC curve
● Area under the ROC curve (integral)
● Alternatively, rank correlation between
predictions and targets (link)
● Is looking at the entire confusion matrix
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
ROC AUC score
Score-based metrics
How to calculate it?
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
ROC AUC score
Score-based metrics
When to use it?
● You care about ranking predictions (not about getting calibrated probabilities)
● Do not use when data heavily imbalanced and you care only about the positive class (link)
● When you care equally about the positive and negative classes
Precision Recall curve
Score-based metrics
What is it?
● visualizes the tradeoff between
precision and recall
● for every threshold, we calculate
precision and recall and plot it on one
chart
Precision Recall curve
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_precision_recall
fig, ax = plt.subplots()
plot_precision_recall(y_true, y_pred, ax=ax)
Precision Recall curve
Score-based metrics
When to use it?
● You want to see model performance over all thresholds
● Want to visually compare multiple models
● Want to find a good threshold for class assignment
● Care more about the positive class
PR AUC score | Average precision
Score-based metrics
What is it?
● One number that summarizes Precision
Recall curve
● Area under the Precision Recall curve
(integral)
● Doesn’t look at True Negatives!
True
Negative
False
Negative
True
Positive
False
Positive
Actual
Predicted
0 1
1
0
PR AUC score | Average precision
Score-based metrics
How to calculate it?
from sklearn.metrics import average_precision_score
average_precision_score(y_true, y_pred_pos)
PR AUC score | Average precision
Score-based metrics
When to use it?
● Data is heavily imbalanced and you care only about the positive class (link)
● You care more about the positive than negative class
● you want to choose the threshold that fits the business problem
● you want to communicate precision/recall decision
Brier score
Score-based metrics
What is it?
● Distance from your predictions to
the true value (label)
Brier score
Score-based metrics
How to calculate it?
from sklearn.metrics import
brier_score_loss
brier_score_loss(y_true, y_pred_pos)
Brier score
Score-based metrics
When to use it?
● When you care about calibrated probabilities
● Why care about calibration?
y_true = [0, 1, 1, 0, 1, 1, 1, 0]
y_pred_v1 = [0.28, 0.35, 0.32, 0.29, 0.34, 0.38, 0.37, 0.31]
y_pred_v2 = [0.18, 0.95, 0.92, 0.19, 0.94, 0.98, 0.97, 0.21]
1.000, 1.000
0.295, 0.0158
roc_auc_score(y_true, y_pred_v1), roc_auc_score(y_true, y_pred_v2)
brier_score_loss(y_true, y_pred_v1), brier_score_loss(y_true, y_pred_v2)
Cumulative gain chart
Score-based metrics
What is it?
● Shows how much your model gains
over the random model
● Calculate it by:
○ Order predictions
○ Calculate fraction of True
Positives for your model and
random model
○ Plot them on one chart
Cumulative gain chart
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_cumulative_gain
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
Cumulative gain chart
Score-based metrics
When to use it?
● You want to select the most promising customers transactions from a dataset
● You are looking for a good cutoff point
● Can be a good addition to ROC AUC score
Lift chart
Score-based metrics
What is it?
● Shows how much your model gains over
the random model
● Calculate it by:
○ Order predictions
○ Calculate fraction of True Positives
for your model and random model
○ Calculate the ratio of your model
gains over random model
○ Plot them on one chart
Lift chart
Score-based metrics
How to plot it?
from scikitplot.metrics import plot_cumulative_gain
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
Lift chart
Score-based metrics
When to use it?
● You want to select the most promising customers transactions from a dataset
● You are looking for a good cutoff point
● Can be a good addition to ROC AUC score
Common This vs That
Precision vs Recall
Common This vs That
● Looks at True Positives and False
Positives
● Only guilty should be in jail
● Higher threshold
○ less predicted positives
○ higher precision
○ lower recall
Precision
● Looks at True Positives and False
Negatives
● Put all guilty in jail
● Lower threshold
○ more predicted positives
○ higher recall
○ lower precision
Recall
F1 score vs Accuracy
Common This vs That
● Doesn’t look at True Negatives
● You care more about the positive
class
● Balances Precision and Recall
F1 score
● Looks at all elements from the
confusion matrix
● You care equally about positive
and negative class
● Very bad for imbalanced problems
Accuracy
ROC AUC vs PR AUC
Common This vs That
● Looks at all elements from the
confusion matrix (for all
thresholds)
● You care equally about positive
and negative class
● Data is not heavily imbalanced
ROC AUC
● Doesn’t look at True Negatives (for
all thresholds)
● You care more about positive
class
● Data is heavily imbalanced
PR AUC
F1 score vs ROC AUC
Common This vs That
● Class based metric -> you need to
choose a threshold
● You care more about positive than
negative class
● Doesn’t look at True Negatives
● Easier to communicate
F1 score
● Score based metric
● Good for ranking predictions
● You care both about positive and
negative class
● You data is not heavily imbalanced
ROC AUC
Extras
Materials
● Slides available on twitter (@NeptuneML) linkedin/slideshare (link)
● Github repository github.com/neptune-ml (link)
● Blog post “24 Evaluation Metrics for Binary Classification (And
When to Use Them)”
● Blog post “F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which
Evaluation Metric Should You Choose?”
Extras
Metrics cheatsheet
● Download the cheatsheet (link)
Extras
Binary metrics logger
● Log binary metrics with one
function call
● pip install neptune-contrib (link)
● example (link)
Extras
import neptune
import neptunecontrib.monitoring.metrics as npt_metrics
neptune.create_experiment()
npt_metrics.log_binary_classification_metrics(y_true, y_pred)
Fairness metrics logger
● Log fairness metrics with one function call
● pip install neptune-contrib (link)
● example (link)
Extras
import neptune
import neptunecontrib.monitoring.fairness as npt_fair
neptune.create_experiment()
npt_fair.log_fairness_classification_metrics(
y_true, y_pred, y_class, x_protected,
favorable_label=0, unfavorable_label=1,
privileged_groups={'race':[3]},
unprivileged_groups={'race':[1,2,4]})
The most lightweight experiment
management tool that fits any workflow.
jakub@neptune.ml
@NeptuneML
https://neptune.ml/blog
Jakub Czakon

More Related Content

What's hot

CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
Nandhini S
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
Bill Liu
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
EdutechLearners
 
AI_Session 7 Greedy Best first search algorithm.pptx
AI_Session 7 Greedy Best first search algorithm.pptxAI_Session 7 Greedy Best first search algorithm.pptx
AI_Session 7 Greedy Best first search algorithm.pptx
Asst.prof M.Gokilavani
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
Armando Vieira
 
Capsule networks
Capsule networksCapsule networks
Capsule networks
Jaehyeon Park
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Philip Goddard
 
Hill Climbing Algorithm in Artificial Intelligence
Hill Climbing Algorithm in Artificial IntelligenceHill Climbing Algorithm in Artificial Intelligence
Hill Climbing Algorithm in Artificial Intelligence
Bharat Bhushan
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Universitat Politècnica de Catalunya
 
Epipolar geometry
Epipolar geometryEpipolar geometry
Epipolar geometry
Safayet Hossain
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
VARUN KUMAR
 
Machine Learning Model Evaluation Methods
Machine Learning Model Evaluation MethodsMachine Learning Model Evaluation Methods
Machine Learning Model Evaluation Methods
Pyingkodi Maran
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
Tamer Ahmed Farrag, PhD
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
Upekha Vandebona
 
CS403: Operating System : Unit I _merged.pdf
CS403: Operating System :  Unit I _merged.pdfCS403: Operating System :  Unit I _merged.pdf
CS403: Operating System : Unit I _merged.pdf
Asst.prof M.Gokilavani
 
MNIST and machine learning - presentation
MNIST and machine learning - presentationMNIST and machine learning - presentation
MNIST and machine learning - presentation
Steve Dias da Cruz
 
Graph Based Pattern Recognition
Graph Based Pattern RecognitionGraph Based Pattern Recognition
Graph Based Pattern Recognition
Nicola Strisciuglio
 
Back propagation
Back propagationBack propagation
Back propagation
Nagarajan
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
Convolution final slides
Convolution final slidesConvolution final slides
Convolution final slides
ramyasree_ssj
 

What's hot (20)

CSA 3702 machine learning module 2
CSA 3702 machine learning module 2CSA 3702 machine learning module 2
CSA 3702 machine learning module 2
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
AI_Session 7 Greedy Best first search algorithm.pptx
AI_Session 7 Greedy Best first search algorithm.pptxAI_Session 7 Greedy Best first search algorithm.pptx
AI_Session 7 Greedy Best first search algorithm.pptx
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
 
Capsule networks
Capsule networksCapsule networks
Capsule networks
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
 
Hill Climbing Algorithm in Artificial Intelligence
Hill Climbing Algorithm in Artificial IntelligenceHill Climbing Algorithm in Artificial Intelligence
Hill Climbing Algorithm in Artificial Intelligence
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
 
Epipolar geometry
Epipolar geometryEpipolar geometry
Epipolar geometry
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Machine Learning Model Evaluation Methods
Machine Learning Model Evaluation MethodsMachine Learning Model Evaluation Methods
Machine Learning Model Evaluation Methods
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
CS403: Operating System : Unit I _merged.pdf
CS403: Operating System :  Unit I _merged.pdfCS403: Operating System :  Unit I _merged.pdf
CS403: Operating System : Unit I _merged.pdf
 
MNIST and machine learning - presentation
MNIST and machine learning - presentationMNIST and machine learning - presentation
MNIST and machine learning - presentation
 
Graph Based Pattern Recognition
Graph Based Pattern RecognitionGraph Based Pattern Recognition
Graph Based Pattern Recognition
 
Back propagation
Back propagationBack propagation
Back propagation
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Convolution final slides
Convolution final slidesConvolution final slides
Convolution final slides
 

Similar to Evaluation metrics for binary classification - the ultimate guide

Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheet
Jakub Czakon
 
Recommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model EvaluationRecommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model Evaluation
Crossing Minds
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Md. Main Uddin Rony
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
Rebecca Bilbro
 
Lecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptxLecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptx
GauravSonawane51
 
Mr4 ms10
Mr4 ms10Mr4 ms10
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
shuchismitjha2
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
Johnson Ubah
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
ajondaree
 
Sap sd important interview concepts
Sap sd important interview concepts Sap sd important interview concepts
Sap sd important interview concepts
Mohit Amitabh
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)
Abhimanyu Dwivedi
 
Yelp dataset challenge
Yelp dataset challengeYelp dataset challenge
Yelp dataset challenge
ArnabKMCS156
 
Yelp dataset challenge
Yelp dataset challengeYelp dataset challenge
Yelp dataset challenge
arnkmish
 
Classification
ClassificationClassification
Classification
CloudxLab
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
weka Content
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
DataminingTools Inc
 
JA, PA, Selection 2016
JA, PA, Selection 2016JA, PA, Selection 2016
JA, PA, Selection 2016
Alyssa Marie Gradus
 
Data analytics in fraud detection and customer feedback
Data analytics in fraud detection and customer feedbackData analytics in fraud detection and customer feedback
Data analytics in fraud detection and customer feedback
Ankit Jain
 
Download the presentation
Download the presentationDownload the presentation
Download the presentation
butest
 
1000 track2 Bharadwaj
1000 track2 Bharadwaj1000 track2 Bharadwaj
1000 track2 Bharadwaj
Rising Media, Inc.
 

Similar to Evaluation metrics for binary classification - the ultimate guide (20)

Binary classification metrics_cheatsheet
Binary classification metrics_cheatsheetBinary classification metrics_cheatsheet
Binary classification metrics_cheatsheet
 
Recommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model EvaluationRecommender Systems from A to Z – Model Evaluation
Recommender Systems from A to Z – Model Evaluation
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Lecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptxLecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptx
 
Mr4 ms10
Mr4 ms10Mr4 ms10
Mr4 ms10
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
 
Sap sd important interview concepts
Sap sd important interview concepts Sap sd important interview concepts
Sap sd important interview concepts
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)
 
Yelp dataset challenge
Yelp dataset challengeYelp dataset challenge
Yelp dataset challenge
 
Yelp dataset challenge
Yelp dataset challengeYelp dataset challenge
Yelp dataset challenge
 
Classification
ClassificationClassification
Classification
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
 
JA, PA, Selection 2016
JA, PA, Selection 2016JA, PA, Selection 2016
JA, PA, Selection 2016
 
Data analytics in fraud detection and customer feedback
Data analytics in fraud detection and customer feedbackData analytics in fraud detection and customer feedback
Data analytics in fraud detection and customer feedback
 
Download the presentation
Download the presentationDownload the presentation
Download the presentation
 
1000 track2 Bharadwaj
1000 track2 Bharadwaj1000 track2 Bharadwaj
1000 track2 Bharadwaj
 

Recently uploaded

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Evaluation metrics for binary classification - the ultimate guide

  • 1. Evaluation Metrics for Binary Classification: The Ultimate Guide jakub@neptune.ml @NeptuneML https://neptune.ml/blog Jakub Czakon
  • 2. ● Intro ● Class-based metrics ● Score-based metrics ● Common “This vs That” ● Extras Agenda
  • 4. Intro Evaluation Metric: ● is a model performance indicator/proxy ● (strongly) depends on the problem ● rarely maps 1-1 to your business problem ● is not a guarantee of performance on other metrics
  • 5. Intro Class-based metrics Score-based metrics model = lightgbm.LGBMClassifier(**MODEL_PARAMS) model.fit(X_train, y_train) y_test_pred = model.predict_proba(X_test) y_test_class = y_test_pred > threshold metric(y_test_true, y_test_class) metric(y_test_true, y_test_pred)
  • 7. Class-based metrics True Negative False Negative True PositiveFalse Positive What is it? Confusion Matrix
  • 8. Confusion Matrix Class-based metrics import matplotlib.pyplot as plt import seaborn as sns from sklearn.metrics import confusion_matrix fig, ax = plt.subplots() cm = confusion_matrix(y_true, y_pred_class) sns.heatmap(cm, cmap=plt.get_cmap('Blues'), annot=True, fmt='g', ax=ax) ax.set_xlabel('predicted values') ax.set_ylabel('actual values') How to plot it?
  • 9. Confusion Matrix Class-based metrics ● Pretty much always ● I prefer nominal values to normalized (can see problems) When to use it?
  • 10. False Positive Rate | Type I error Class-based metrics What is it? ● When we predict something that isn’t ● Fraction of false alerts True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 11. False Positive Rate | Type I error Class-based metrics from sklearn.metrics import confusion_matrix y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() false_positive_rate = fp / (fp + tn) How to calculate it?
  • 12. False Positive Rate | Type I error Class-based metrics How to choose a threshold?
  • 13. False Positive Rate | Type I error Class-based metrics When to use it? ● rarely used alone but can be auxiliary metric ● if the cost of dealing with an alert is high
  • 14. False Negative Rate | Type II error Class-based metrics What is it? ● When we don’t predict something when it is ● fraction of missed fraudulent transactions True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 15. from sklearn.metrics import confusion_matrix y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() false_negative_rate = fn / (tp + fn) False Negative Rate | Type II error Class-based metrics How to calculate it?
  • 16. False Negative Rate | Type II error Class-based metrics How to choose a threshold?
  • 17. False Negative Rate | Type II error Class-based metrics When to use it? ● rarely used alone but can be auxiliary metric ● if the cost of dealing with an alert is high
  • 18. True Negative Rate | Specificity Class-based metrics What is it? ● how good we are at predicting negative class ● same axis as False Positive Rate ● How many non-fraudulent transactions marked as clean True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 19. True Negative Rate | Specificity Class-based metrics How to calculate it? from sklearn.metrics import confusion_matrix y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() true_negative_rate = tn / (tn + fp)
  • 20. True Negative Rate | Specificity Class-based metrics How to choose a threshold?
  • 21. True Negative Rate | Specificity Class-based metrics When to use it? ● rarely used alone but can be auxiliary metric ● When you want to feel good when you say “you are healthy” or “this transaction is clean”
  • 22. True Positive Rate | Recall | Sensitivity Class-based metrics What is it? ● how good we are at finding positive class members ● put all guilty in prison ● same axis as False Negative Rate True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 23. True Positive Rate | Recall | Sensitivity Class-based metrics How to calculate it? from sklearn.metrics import confusion_matrix, recall_score y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() true_positive_rate = tp / (tp + fn) # or simply recall_score(y_true, y_pred_class)
  • 24. True Positive Rate | Recall | Sensitivity Class-based metrics How to choose a threshold?
  • 25. True Positive Rate | Recall | Sensitivity Class-based metrics When to use it? ● Rarely used alone but can be auxiliary metric ● You want to catch all fraudulent transactions ● False alerts are cheap to process
  • 26. Positive Predictive Value | Precision Class-based metrics What is it? ● how accurate are we when we say positive class ● Only guilty people should be in prison True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 27. Positive Predictive Value | Precision Class-based metrics How to calculate it? from sklearn.metrics import confusion_matrix, precision_score y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() positive_predictive_value = tp/ (tp + fp) # or simply precision_score(y_true, y_pred_class)
  • 28. Positive Predictive Value | Precision Class-based metrics How to choose a threshold?
  • 29. Class-based metrics When to use it? ● Rarely used alone but can be auxiliary metric ● False alerts are expensive to process ● You want to catch only fraudulent transactions Positive Predictive Value | Precision
  • 30. Accuracy Class-based metrics What is it? ● Fraction of correctly classified observations (positive and negative) True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 31. Accuracy Class-based metrics How to calculate it? from sklearn.metrics import confusion_matrix, accuracy_score y_pred_class = y_pred_pos > threshold tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel() accuracy = (tp + tn) / (tp + fp + fn + tn) # or simply accuracy_score(y_true, y_pred_class)
  • 32. Accuracy Class-based metrics How to choose a threshold?
  • 33. Accuracy Class-based metrics When to use it? ● When your problem is balanced ● When every class is equally important to you ● When you need something easy-to-explain to stakeholders
  • 34. F score Class-based metrics What is it? ● Combines Precision and Recall into one score ● Weighted harmonic mean ● Doesn’t care about True Negatives ● F1 score (beta=1)-> harmonic mean ● F2 score (beta=2)-> 2x emphasis on recall True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 35. F score Class-based metrics How to calculate it? from sklearn.metrics import fbeta_score y_pred_class = y_pred_pos > threshold fbeta_score(y_true, y_pred_class, beta)
  • 36. F score Class-based metrics How to choose a threshold?
  • 37. F score Class-based metrics How to choose a threshold?
  • 38. F score Class-based metrics How to choose a threshold?
  • 39. F score Class-based metrics When to use it? F1 score ● my go-to metric for binary classification ● easy-to-explain to stakeholders F2 score ● When you need to adjust precision recall tradeoff ● when finding positive fraudulent transactions is more important than being correct about it
  • 40. Cohen Kappa Class-based metrics What is it? ● how much than a random classifier your model is ● Observed agreement p0 : accuracy ● Expected agreement pe : accuracy of the random classifier ● Random classifier: samples randomly according to class frequencies True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 41. Cohen Kappa Class-based metrics How to calculate it? from sklearn.metrics import cohen_kappa_score y_pred_class = y_pred_pos > threshold cohen_kappa_score(y_true, y_pred_class)
  • 42. Cohen Kappa Class-based metrics How to choose a threshold?
  • 43. Cohen Kappa Class-based metrics When to use it? ● Unpopular metric for classification ● Works well for unbalanced problems ● Good substitute for accuracy when you need a metric that is easy to explain
  • 44. Matthews Correlation Coefficient Class-based metrics What is it? ● Correlation between predicted classes and the ground truth labels True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 45. Matthews Correlation Coefficient Class-based metrics How to calculate it? from sklearn.metrics import matthews_corrcoef y_pred_class = y_pred_pos > threshold matthews_corrcoef(y_true, y_pred_class)
  • 46. Matthews Correlation Coefficient Class-based metrics How to choose a threshold?
  • 47. Matthews Correlation Coefficient Class-based metrics When to use it? ● Unpopular metric for classification ● Works well for unbalanced problems ● Good substitute for accuracy when you need a metric that is easy to explain
  • 48. Dollar-focused metrics Class-based metrics ● Get the cost of False Negative (letting fraudulent transaction) ● Get the cost of False Positive (blocking of clean transaction) ● Find optimal threshold in dollars $ -> optimize business problem directly ● Blog post from Airbnb (link)
  • 49. Fairness metrics Class-based metrics ● Divide your dataset into groups based on protected feature (race, sex, etc) -> get privileged and unprivileged groups ● Calculate True Positive Rate for both groups ● Calculate the difference in TPR -> get Equality of Opportunity Metric ● Blog post on fairness (link)
  • 51. ROC curve Score-based metrics What is it? ● visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). ● for every threshold, we calculate TPR and FPR and plot it on one chart.
  • 52. ROC curve Score-based metrics How to plot it? from scikitplot.metrics import plot_roc fig, ax = plt.subplots() plot_roc(y_true, y_pred, ax=ax)
  • 53. ROC curve Score-based metrics When to use it? ● You want to see model performance over all thresholds ● Want to visually compare multiple models ● Care equally about both positive and negative class
  • 54. ROC AUC score Score-based metrics What is it? ● One number that summarizes ROC curve ● Area under the ROC curve (integral) ● Alternatively, rank correlation between predictions and targets (link) ● Is looking at the entire confusion matrix True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 55. ROC AUC score Score-based metrics How to calculate it? from scikitplot.metrics import plot_roc fig, ax = plt.subplots() plot_roc(y_true, y_pred, ax=ax)
  • 56. ROC AUC score Score-based metrics When to use it? ● You care about ranking predictions (not about getting calibrated probabilities) ● Do not use when data heavily imbalanced and you care only about the positive class (link) ● When you care equally about the positive and negative classes
  • 57. Precision Recall curve Score-based metrics What is it? ● visualizes the tradeoff between precision and recall ● for every threshold, we calculate precision and recall and plot it on one chart
  • 58. Precision Recall curve Score-based metrics How to plot it? from scikitplot.metrics import plot_precision_recall fig, ax = plt.subplots() plot_precision_recall(y_true, y_pred, ax=ax)
  • 59. Precision Recall curve Score-based metrics When to use it? ● You want to see model performance over all thresholds ● Want to visually compare multiple models ● Want to find a good threshold for class assignment ● Care more about the positive class
  • 60. PR AUC score | Average precision Score-based metrics What is it? ● One number that summarizes Precision Recall curve ● Area under the Precision Recall curve (integral) ● Doesn’t look at True Negatives! True Negative False Negative True Positive False Positive Actual Predicted 0 1 1 0
  • 61. PR AUC score | Average precision Score-based metrics How to calculate it? from sklearn.metrics import average_precision_score average_precision_score(y_true, y_pred_pos)
  • 62. PR AUC score | Average precision Score-based metrics When to use it? ● Data is heavily imbalanced and you care only about the positive class (link) ● You care more about the positive than negative class ● you want to choose the threshold that fits the business problem ● you want to communicate precision/recall decision
  • 63. Brier score Score-based metrics What is it? ● Distance from your predictions to the true value (label)
  • 64. Brier score Score-based metrics How to calculate it? from sklearn.metrics import brier_score_loss brier_score_loss(y_true, y_pred_pos)
  • 65. Brier score Score-based metrics When to use it? ● When you care about calibrated probabilities ● Why care about calibration? y_true = [0, 1, 1, 0, 1, 1, 1, 0] y_pred_v1 = [0.28, 0.35, 0.32, 0.29, 0.34, 0.38, 0.37, 0.31] y_pred_v2 = [0.18, 0.95, 0.92, 0.19, 0.94, 0.98, 0.97, 0.21] 1.000, 1.000 0.295, 0.0158 roc_auc_score(y_true, y_pred_v1), roc_auc_score(y_true, y_pred_v2) brier_score_loss(y_true, y_pred_v1), brier_score_loss(y_true, y_pred_v2)
  • 66. Cumulative gain chart Score-based metrics What is it? ● Shows how much your model gains over the random model ● Calculate it by: ○ Order predictions ○ Calculate fraction of True Positives for your model and random model ○ Plot them on one chart
  • 67. Cumulative gain chart Score-based metrics How to plot it? from scikitplot.metrics import plot_cumulative_gain fig, ax = plt.subplots() plot_cumulative_gain(y_true, y_pred, ax=ax)
  • 68. Cumulative gain chart Score-based metrics When to use it? ● You want to select the most promising customers transactions from a dataset ● You are looking for a good cutoff point ● Can be a good addition to ROC AUC score
  • 69. Lift chart Score-based metrics What is it? ● Shows how much your model gains over the random model ● Calculate it by: ○ Order predictions ○ Calculate fraction of True Positives for your model and random model ○ Calculate the ratio of your model gains over random model ○ Plot them on one chart
  • 70. Lift chart Score-based metrics How to plot it? from scikitplot.metrics import plot_cumulative_gain fig, ax = plt.subplots() plot_cumulative_gain(y_true, y_pred, ax=ax)
  • 71. Lift chart Score-based metrics When to use it? ● You want to select the most promising customers transactions from a dataset ● You are looking for a good cutoff point ● Can be a good addition to ROC AUC score
  • 73. Precision vs Recall Common This vs That ● Looks at True Positives and False Positives ● Only guilty should be in jail ● Higher threshold ○ less predicted positives ○ higher precision ○ lower recall Precision ● Looks at True Positives and False Negatives ● Put all guilty in jail ● Lower threshold ○ more predicted positives ○ higher recall ○ lower precision Recall
  • 74. F1 score vs Accuracy Common This vs That ● Doesn’t look at True Negatives ● You care more about the positive class ● Balances Precision and Recall F1 score ● Looks at all elements from the confusion matrix ● You care equally about positive and negative class ● Very bad for imbalanced problems Accuracy
  • 75. ROC AUC vs PR AUC Common This vs That ● Looks at all elements from the confusion matrix (for all thresholds) ● You care equally about positive and negative class ● Data is not heavily imbalanced ROC AUC ● Doesn’t look at True Negatives (for all thresholds) ● You care more about positive class ● Data is heavily imbalanced PR AUC
  • 76. F1 score vs ROC AUC Common This vs That ● Class based metric -> you need to choose a threshold ● You care more about positive than negative class ● Doesn’t look at True Negatives ● Easier to communicate F1 score ● Score based metric ● Good for ranking predictions ● You care both about positive and negative class ● You data is not heavily imbalanced ROC AUC
  • 78. Materials ● Slides available on twitter (@NeptuneML) linkedin/slideshare (link) ● Github repository github.com/neptune-ml (link) ● Blog post “24 Evaluation Metrics for Binary Classification (And When to Use Them)” ● Blog post “F1 Score vs ROC AUC vs Accuracy vs PR AUC: Which Evaluation Metric Should You Choose?” Extras
  • 79. Metrics cheatsheet ● Download the cheatsheet (link) Extras
  • 80. Binary metrics logger ● Log binary metrics with one function call ● pip install neptune-contrib (link) ● example (link) Extras import neptune import neptunecontrib.monitoring.metrics as npt_metrics neptune.create_experiment() npt_metrics.log_binary_classification_metrics(y_true, y_pred)
  • 81. Fairness metrics logger ● Log fairness metrics with one function call ● pip install neptune-contrib (link) ● example (link) Extras import neptune import neptunecontrib.monitoring.fairness as npt_fair neptune.create_experiment() npt_fair.log_fairness_classification_metrics( y_true, y_pred, y_class, x_protected, favorable_label=0, unfavorable_label=1, privileged_groups={'race':[3]}, unprivileged_groups={'race':[1,2,4]})
  • 82. The most lightweight experiment management tool that fits any workflow. jakub@neptune.ml @NeptuneML https://neptune.ml/blog Jakub Czakon