SlideShare a Scribd company logo
Confusion
Trust is a must when a decision-maker's judgment is critical. To give such
trust, we summarize all possible decision outcomes into four categories:
True Positives (TP), False Positives (FP), True Negatives (TN), and False
Negatives (FN) to serve an outlook of how confused their judgments are,
namely, the confusion matrix. From the confusion matrix, we calculate
different metrics to measure the quality of the outcomes. These
measures influence how much trust we should give to the decision-
maker (classifier) in particular use cases. This document will discuss the
most common classification evaluation metrics, their focuses, and their
limitations in a straightforward and informative manner.
Matrix TN
TP
FP
FN
and Classification Evaluation Metrics
Author: Yousef Alghofaili
1
NPV
(Negative Predictive Value)
TN
TN FN
True Negative Rate
Specificity
TN
TN FP
TP
TP
Recall
FN
Sensitivity or True Positive Rate
TP
TP FP
Precision
Positive Predictive Value
Matthews Correlation Coefficient (MCC)
( )
FP FN
( )
TP TN
( )
TP FP TP FN FP TN FN
( ) ( ) ( )
Accuracy
TP TN
TP TN
FP FN
F1-Score
Precision Recall
Precision Recall
2
Balanced Accuracy
Sensitivity Specificity
2
2
TN
Author: Yousef Alghofaili
Predicted Value
Actual
Value
Negative Positive
Positive
Negative
TN
TP
FP
FN
TN
This car is NOT red This car is NOT red
TP
This car is red This car is red
FN
This car is NOT red This car is red
FP
This car is red This car is NOT red
Guess Fact
Confusion Matrix
Precision & Recall
FP FP
FP
FP
FP
FP
This customer loves steak! No, this customer is vegan.
Bad Product Recommendation → Less Conversion → Decrease in Sales
FN FN
FN
FN
FN
FN
The product has no defects. A customer called... He's angry.
Bad Defect Detector → Bad Quality → Customer Dissatisfaction
TP
TP TP TP TP
Common Goal
Precision Goal
Recall Goal
We use both metrics when actual negatives are less relevant. For example, googling "Confusion Matrix"
will have trillions of unrelated (negative) web pages, such as the "Best Pizza Recipe!" web page. Accounting
for whether we have correctly predicted the latter webpage and alike as negative is impractical.
3 Author: Yousef Alghofaili
Specificity & NPV
FN FN
FN
FN
FN
FN
No, they should be treated!
Bad Diagnosis → No Treatment → Consequences
FP FP
FP
FP
FP
FP
This person is a criminal. They were detained for no reason.
Bad Predictive Policing → Injustice
Specificity Goal
NPV Goal
TN
TN TN TN TN
Common Goal
We use both metrics when actual positives are less relevant. In essence, we aim to rule out a
phenomenon. For example, we want to know how many healthy people (no disease detected) there are
in a population. Or, how many trustworthy websites (not fraudulent) is someone visiting.
They don't have cancer.
4 Author: Yousef Alghofaili
Hacks
Previously explained evaluation metrics, among many, are
granular, as they focus on one angle of prediction quality
which can mislead us into thinking that a predictive model
is highly accurate. Generally, these metrics are not used
solely. Let us see how easy it is to manipulate the
aforementioned metrics.
5 Author: Yousef Alghofaili
Get at least one positive sample correctly Predict almost all samples as negative
TP≈1
TP≈1 FP≈0
100%
Predicting positive samples with a high confidence threshold would potentially bring out this case. In
addition, when positive samples are disproportionately higher than negatives, false positives will
probabilistically be rarer. Hence, precision will tend to be high.
Precision
Negatives
TN≈50
Positives
FN≈49
TP≈1
Hacking
6 Author: Yousef Alghofaili
Precision is the ratio of correctly classified positive samples to the total number of positive predictions.
Hence the name, Positive Predictive Value.
50 Positive Samples 50 Negative Samples
Dataset
7 Author: Yousef Alghofaili
50 Positive Samples 50 Negative Samples
Dataset
Recall Hacking
Recall is the ratio of correctly classified positive samples to the total number of actual positive samples.
Hence the name, True Positive Rate.
Predict all samples as positive
Negatives
Positives
TP=50 FP=50
TP=50
TP=50 FN=0
100%
Similar to precision, when positive samples are disproportionately higher, the classifier would generally
be biased towards positive class predictions to reduce the number of mistakes.
8 Author: Yousef Alghofaili
50 Positive Samples 50 Negative Samples
Dataset
Specificity Hacking
Specificity is the ratio of correctly classified negative samples to the total number of actual negative
samples. Hence the name, True Negative Rate.
Predict all samples as negative
Negatives
Positives
FN=50 TN=50
TN=50
TN=50 FP=0
100%
Contrary to Recall (Sensitivity), Specificity focuses on the negative class. Hence, we face this problem
when negative samples are disproportionately higher. Notice how the Balanced Accuracy metric
intuitively solves this issue in subsequent pages.
9 Author: Yousef Alghofaili
50 Positive Samples 50 Negative Samples
Dataset
Negative Predictive Value is the ratio of correctly classified negative samples to the total number of negative
predictions. Hence the name.
NPV Hacking
Get at least one negative sample correctly
Predict almost all samples as positive
Negatives
Positives
TP≈50 FP≈49
TN≈1
TN≈1
TN≈1 FN≈0
100%
Predicting negative samples with a high confidence threshold has this case as a consequence. Also,
when negative samples are disproportionately higher, false negatives will probabilistically be rarer.
Thus, NPV will tend to be high.
As we have seen above, some metrics can misinform us
about the actual performance of a classifier. However,
there are other metrics that include more information
about the performance. Nevertheless, all metrics can be
“hacked” in one way or another. Hence, we commonly
report multiple metrics to observe multiple viewpoints
of the model's performance.
10 Author: Yousef Alghofaili
Comprehensive
Metrics
Accuracy
TP TN
TP TN
FP FN
Accuracy treats all error types (false positives and false negatives) as equal. However, equal is not always
preferred.
Since accuracy assigns equal cost to all error types, having significantly more positive samples than
negatives will make accuracy biased towards the larger class. In fact, the Accuracy Paradox is a direct
"hack" against the metric. Assume you have 99 samples of class 1 and 1 sample of class 0. If your
classifier predicts everything as class 1, it will get an accuracy of 99%.
Positives Negatives
Accuracy Paradox
11 Author: Yousef Alghofaili
F1-Score is asymmetric to the choice of which class is negative or positive. Changing the positive
class into the negative one will not produce a similar score in most cases.
Precision Recall
Precision Recall
2
F1-Score does not account for true negatives. For example, correctly diagnosing a patient with no
disease (true negative) has no impact on the F1-Score.
Positives Negatives Negatives Positives
Asymmetric Measure
True Negatives Absence
F1-Score
F1-Score will combine precision and recall in a way that is sensitive to a decrease in any of the two (Harmonic
Mean). Note that the issues mentioned below do apply to Fβ
score in general.
Extra: Know more about Fβ
Score on Page 18 Author: Yousef Alghofaili
12
Balanced Accuracy
Balanced Accuracy accounts for the positive and negative classes independently using Sensitivity and
Specificity, respectively. The metric partially solves the Accuracy paradox through independent calculation
of error types and solves the true negative absence problem in Fβ
-Score through the inclusion of Specificity.
Sensitivity Specificity
2
Balanced Accuracy is commonly robust against imbalanced datasets, but that does not apply to the
above-illustrated cases. Both models perform poorly at predicting one of the two (positive P or
negative N) classes, therefore unreliable at one. Yet, Balanced Accuracy is 90%, which is misleading.
Relative Differences in error types
13
9
1000
TP
FP
1
TP
FP
9000
9
TN
FN
1000
9000
TN
1
FN
Author: Yousef Alghofaili
Matthews Correlation Coefficient (MCC)
MCC calculates the correlation between the actual and predicted labels, which produces a number between
-1 and 1. Hence, it will only produce a good score if the model is accurate in all confusion matrix components.
MCC is the most robust metric against imbalanced dataset issues or random classifications.
( )
FP FN
( )
TP TN
( )
TP FP TP FN FP TN FN
( ) ( ) ( )
MCC faces an issue of it being undefined whenever a full
row or a column in a confusion matrix is zeros. However,
the issue is outside the scope of this document. Note
that this is solved by simply substituting zeros with an
arbitrarily small value.
14
TN
Author: Yousef Alghofaili
Conclusion
We have gone through all confusion matrix components,
discussed some of the most popular metrics, how easy it is
for them to be "hacked", alternatives to overcome these
problems through more generalized metrics, and each one's
limitations. The key takeaways are:
Recognize the hacks against granular metrics as you might fall into one unintentionally. Although
these metrics are not solely used in reporting, they are heavily used in development settings to
debug a classifier's behavior.
Know the limitations of popular classification evaluations metrics used in reporting so that you
become equipped with enough acumen to decide whether you have obtained the optimal
classifier or not.
Never get persuaded by the phrase "THE BEST" in the context of machine learning, especially
evaluation metrics. Every metric approached in this document (including MCC) is the
best metric only when it best fits the project's objective.
15 Author: Yousef Alghofaili
References
Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient
(MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics, 21(1),
1-13.
16
Chicco, D., Tötsch, N., & Jurman, G. (2021). The Matthews correlation coefficient (MCC) is
more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class
confusion matrix evaluation. BioData mining, 14(1), 1-22.
Lalkhen, A. G., & McCluskey, A. (2008). Clinical tests: sensitivity and specificity. Continuing
education in anaesthesia critical care & pain, 8(6), 221-223.
Hull, D. (1993, July). Using statistical testing in the evaluation of retrieval experiments. In
Proceedings of the 16th annual international ACM SIGIR conference on Research and
development in information retrieval (pp. 329-338).
Hossin, M., & Sulaiman, M. N. (2015). A review on evaluation metrics for data classification
evaluations. International journal of data mining & knowledge management process, 5(2), 1.
Jurman, G., Riccadonna, S., & Furlanello, C. (2012). A comparison of MCC and CEN error
measures in multi-class prediction.
Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData
mining, 10(1), 1-17.
Author: Yousef Alghofaili
Yousef Alghofaili
AI solutions architect and researcher who has studied at KFUPM
and Georgia Institute of Technology. He worked with multiple
research groups from KAUST, KSU, and KFUPM. He has also built
and managed noura.ai data science R&D team as the AI Director.
He is an official author at Towards Data Science Publication and
developer of KMeansInterp Algorithm.
THANK YOU!
17
For any feedback, issues, or inquiries, contact yousefalghofaili@gmail.com
Author: Yousef Alghofaili
Author
Reviewer
Dr. Motaz Alfarraj
Assistant Professor at KFUPM, and the Acting Director of SDAIA-
KFUPM Joint Research Center for Artificial Intelligence (JRC-AI).
He has received his Bachelor's degree from KFUPM, and earned
his Master's and Ph.D. degrees in Electrical Engineering, Digital
Image Processing and Computer Vision from Georgia Institute
of Technology. He has contributed to ML research as an author
of many research papers and won many awards in his field.
https://www.linkedin.com/in/yousefgh/
https://www.linkedin.com/in/yousefgh/
https://www.linkedin.com/in/yousefgh/
https://www.linkedin.com/in/yousefgh/
https://www.linkedin.com/in/yousefgh/
https://www.linkedin.com/in/yousefgh/
https://www.linkedin.com/in/yousefgh/
https://www.linkedin.com/in/motazalfarraj/
https://www.linkedin.com/in/motazalfarraj/
https://www.linkedin.com/in/motazalfarraj/
https://www.linkedin.com/in/motazalfarraj/
https://www.linkedin.com/in/motazalfarraj/
https://www.linkedin.com/in/motazalfarraj/
https://www.linkedin.com/in/motazalfarraj/
mailto:yousefalghofaili@gmail.com
Fβ
Score is the generalized form of F1 Score (Fβ = 1
Score) where the difference lies within the variability of the
β Factor. The β Factor skews the final score into favoring recall β times over precision, enabling us to weigh
the risk of having false negatives (Type II Errors) and false positives (Type I Errors) differently.
Precision Recall
( 1 + β2
)
Precision Recall
( β2
)
β
β
β = 1
Precision is β times Less important than Recall
Precision is β times More important than Recall
Balanced F1-Score
β
β
β = 1
Fβ
Score has been originally developed to evaluate Information Retrieval (IR) systems such as Google
Search Engine. When you search for a webpage, but it does not appear, you are experiencing the
engine's low Recall. When the results you see are completely irrelevant, you are experiencing its low
Precision. Hence, search engines play with the β Factor to optimize User Experience by favoring one
of the two experiences you have had over another.
Fβ
Score
Extra
18 Author: Yousef Alghofaili

More Related Content

What's hot

Introduction to predictive modeling v1
Introduction to predictive modeling v1Introduction to predictive modeling v1
Introduction to predictive modeling v1
Venkata Reddy Konasani
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Marina Santini
 
Confusion Matrix Explained
Confusion Matrix ExplainedConfusion Matrix Explained
Confusion Matrix Explained
Stockholm University
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Simplilearn
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
Pınar Yahşi
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
Md. Enamul Haque Chowdhury
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
Krish_ver2
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
Benazir Income Support Program (BISP)
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Simplilearn
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Association in Frequent Pattern Mining
Association in Frequent Pattern MiningAssociation in Frequent Pattern Mining
Association in Frequent Pattern Mining
ShreeaBose
 
Bayesian networks
Bayesian networksBayesian networks
Bayesian networks
Massimiliano Patacchiola
 
KNN
KNN KNN
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
Md. Ariful Hoque
 
Preparing your data for Machine Learning with Feature Scaling
Preparing your data for  Machine Learning with Feature ScalingPreparing your data for  Machine Learning with Feature Scaling
Preparing your data for Machine Learning with Feature Scaling
Rahul K Chauhan
 
Data mining Concepts and Techniques
Data mining Concepts and Techniques Data mining Concepts and Techniques
Data mining Concepts and Techniques
Justin Cletus
 

What's hot (20)

Introduction to predictive modeling v1
Introduction to predictive modeling v1Introduction to predictive modeling v1
Introduction to predictive modeling v1
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Confusion Matrix Explained
Confusion Matrix ExplainedConfusion Matrix Explained
Confusion Matrix Explained
 
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Association in Frequent Pattern Mining
Association in Frequent Pattern MiningAssociation in Frequent Pattern Mining
Association in Frequent Pattern Mining
 
Bayesian networks
Bayesian networksBayesian networks
Bayesian networks
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
KNN
KNN KNN
KNN
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Preparing your data for Machine Learning with Feature Scaling
Preparing your data for  Machine Learning with Feature ScalingPreparing your data for  Machine Learning with Feature Scaling
Preparing your data for Machine Learning with Feature Scaling
 
Data mining Concepts and Techniques
Data mining Concepts and Techniques Data mining Concepts and Techniques
Data mining Concepts and Techniques
 

Similar to Confusion matrix and classification evaluation metrics

Lecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptxLecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptx
GauravSonawane51
 
Model Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recallModel Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recall
Megha Sharma
 
04 performance metrics v2
04 performance metrics v204 performance metrics v2
04 performance metrics v2
Anne Starr
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)
Abhimanyu Dwivedi
 
Performance Metrics, Baseline Model, and Hyper Parameter
Performance Metrics, Baseline Model, and Hyper ParameterPerformance Metrics, Baseline Model, and Hyper Parameter
Performance Metrics, Baseline Model, and Hyper Parameter
IndraFransiskusAlam1
 
Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithm
Hoopeer Hoopeer
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
Shivasharana Marnur
 
The Lachman Test
The Lachman TestThe Lachman Test
The Lachman Test
Laura Torres
 
Important Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptxImportant Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptx
Chode Amarnath
 
A Research Study On Alcohol Abuse
A Research Study On Alcohol AbuseA Research Study On Alcohol Abuse
A Research Study On Alcohol Abuse
Lynn Holkesvik
 
MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.
AmnaArooj13
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - FinalBrian Lin
 
Assessing the predictive capacity measur marco scattareggia
Assessing the predictive capacity measur   marco scattareggiaAssessing the predictive capacity measur   marco scattareggia
Assessing the predictive capacity measur marco scattareggiaScattareggia
 
Assessing the predictive capacity measur marco scattareggia
Assessing the predictive capacity measur   marco scattareggiaAssessing the predictive capacity measur   marco scattareggia
Assessing the predictive capacity measur marco scattareggiaScattareggia
 
Evaluation measures for models assessment over imbalanced data sets
Evaluation measures for models assessment over imbalanced data setsEvaluation measures for models assessment over imbalanced data sets
Evaluation measures for models assessment over imbalanced data sets
Alexander Decker
 
Analysis and Interpretation
Analysis and InterpretationAnalysis and Interpretation
Analysis and Interpretation
Francisco J Grajales III
 
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUCModel Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Megha Sharma
 
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxPage 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
karlhennesey
 
Basics of Sample Size Estimation
Basics of Sample Size EstimationBasics of Sample Size Estimation
Basics of Sample Size Estimation
Mandar Baviskar
 

Similar to Confusion matrix and classification evaluation metrics (20)

Lecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptxLecture-12Evaluation Measures-ML.pptx
Lecture-12Evaluation Measures-ML.pptx
 
Model Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recallModel Evaluation Matrix: Accuracy, precision and recall
Model Evaluation Matrix: Accuracy, precision and recall
 
04 performance metrics v2
04 performance metrics v204 performance metrics v2
04 performance metrics v2
 
Machine learning session5(logistic regression)
Machine learning   session5(logistic regression)Machine learning   session5(logistic regression)
Machine learning session5(logistic regression)
 
Performance Metrics, Baseline Model, and Hyper Parameter
Performance Metrics, Baseline Model, and Hyper ParameterPerformance Metrics, Baseline Model, and Hyper Parameter
Performance Metrics, Baseline Model, and Hyper Parameter
 
Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithm
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
The Lachman Test
The Lachman TestThe Lachman Test
The Lachman Test
 
Important Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptxImportant Classification and Regression Metrics.pptx
Important Classification and Regression Metrics.pptx
 
A Research Study On Alcohol Abuse
A Research Study On Alcohol AbuseA Research Study On Alcohol Abuse
A Research Study On Alcohol Abuse
 
MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
 
Assessing the predictive capacity measur marco scattareggia
Assessing the predictive capacity measur   marco scattareggiaAssessing the predictive capacity measur   marco scattareggia
Assessing the predictive capacity measur marco scattareggia
 
Assessing the predictive capacity measur marco scattareggia
Assessing the predictive capacity measur   marco scattareggiaAssessing the predictive capacity measur   marco scattareggia
Assessing the predictive capacity measur marco scattareggia
 
Research by MAGIC
Research by MAGICResearch by MAGIC
Research by MAGIC
 
Evaluation measures for models assessment over imbalanced data sets
Evaluation measures for models assessment over imbalanced data setsEvaluation measures for models assessment over imbalanced data sets
Evaluation measures for models assessment over imbalanced data sets
 
Analysis and Interpretation
Analysis and InterpretationAnalysis and Interpretation
Analysis and Interpretation
 
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUCModel Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
Model Evaluation Matrix: Confusion Matrix, F1 Score, ROC curve AUC
 
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxPage 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
 
Basics of Sample Size Estimation
Basics of Sample Size EstimationBasics of Sample Size Estimation
Basics of Sample Size Estimation
 

More from Minesh A. Jethva

TadGAN: Time Series Anomaly Detection Using GANs
TadGAN: Time Series Anomaly Detection Using GANsTadGAN: Time Series Anomaly Detection Using GANs
TadGAN: Time Series Anomaly Detection Using GANs
Minesh A. Jethva
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
Minesh A. Jethva
 
Azure Engineering MLOps
Azure Engineering MLOps Azure Engineering MLOps
Azure Engineering MLOps
Minesh A. Jethva
 
Graph machine learning table of content
Graph machine learning table of contentGraph machine learning table of content
Graph machine learning table of content
Minesh A. Jethva
 
Google cloud in 10 slides
Google cloud in 10 slidesGoogle cloud in 10 slides
Google cloud in 10 slides
Minesh A. Jethva
 
What is apache kafka
What is apache kafkaWhat is apache kafka
What is apache kafka
Minesh A. Jethva
 
Roadmap for Data Scientist
Roadmap for Data ScientistRoadmap for Data Scientist
Roadmap for Data Scientist
Minesh A. Jethva
 
Git Practices
Git PracticesGit Practices
Git Practices
Minesh A. Jethva
 
Influenza Drug Resistance Database Development & Analysis Tool
Influenza Drug Resistance Database Development & Analysis ToolInfluenza Drug Resistance Database Development & Analysis Tool
Influenza Drug Resistance Database Development & Analysis Tool
Minesh A. Jethva
 
NGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools SelectionNGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools Selection
Minesh A. Jethva
 
BioFuel - MetaTranscriptomics - Enzyme Activity
BioFuel - MetaTranscriptomics - Enzyme ActivityBioFuel - MetaTranscriptomics - Enzyme Activity
BioFuel - MetaTranscriptomics - Enzyme Activity
Minesh A. Jethva
 

More from Minesh A. Jethva (11)

TadGAN: Time Series Anomaly Detection Using GANs
TadGAN: Time Series Anomaly Detection Using GANsTadGAN: Time Series Anomaly Detection Using GANs
TadGAN: Time Series Anomaly Detection Using GANs
 
Hidden Markov Models
Hidden Markov ModelsHidden Markov Models
Hidden Markov Models
 
Azure Engineering MLOps
Azure Engineering MLOps Azure Engineering MLOps
Azure Engineering MLOps
 
Graph machine learning table of content
Graph machine learning table of contentGraph machine learning table of content
Graph machine learning table of content
 
Google cloud in 10 slides
Google cloud in 10 slidesGoogle cloud in 10 slides
Google cloud in 10 slides
 
What is apache kafka
What is apache kafkaWhat is apache kafka
What is apache kafka
 
Roadmap for Data Scientist
Roadmap for Data ScientistRoadmap for Data Scientist
Roadmap for Data Scientist
 
Git Practices
Git PracticesGit Practices
Git Practices
 
Influenza Drug Resistance Database Development & Analysis Tool
Influenza Drug Resistance Database Development & Analysis ToolInfluenza Drug Resistance Database Development & Analysis Tool
Influenza Drug Resistance Database Development & Analysis Tool
 
NGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools SelectionNGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools Selection
 
BioFuel - MetaTranscriptomics - Enzyme Activity
BioFuel - MetaTranscriptomics - Enzyme ActivityBioFuel - MetaTranscriptomics - Enzyme Activity
BioFuel - MetaTranscriptomics - Enzyme Activity
 

Recently uploaded

Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 

Recently uploaded (20)

Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 

Confusion matrix and classification evaluation metrics

  • 1. Confusion Trust is a must when a decision-maker's judgment is critical. To give such trust, we summarize all possible decision outcomes into four categories: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN) to serve an outlook of how confused their judgments are, namely, the confusion matrix. From the confusion matrix, we calculate different metrics to measure the quality of the outcomes. These measures influence how much trust we should give to the decision- maker (classifier) in particular use cases. This document will discuss the most common classification evaluation metrics, their focuses, and their limitations in a straightforward and informative manner. Matrix TN TP FP FN and Classification Evaluation Metrics Author: Yousef Alghofaili 1
  • 2. NPV (Negative Predictive Value) TN TN FN True Negative Rate Specificity TN TN FP TP TP Recall FN Sensitivity or True Positive Rate TP TP FP Precision Positive Predictive Value Matthews Correlation Coefficient (MCC) ( ) FP FN ( ) TP TN ( ) TP FP TP FN FP TN FN ( ) ( ) ( ) Accuracy TP TN TP TN FP FN F1-Score Precision Recall Precision Recall 2 Balanced Accuracy Sensitivity Specificity 2 2 TN Author: Yousef Alghofaili Predicted Value Actual Value Negative Positive Positive Negative TN TP FP FN TN This car is NOT red This car is NOT red TP This car is red This car is red FN This car is NOT red This car is red FP This car is red This car is NOT red Guess Fact Confusion Matrix
  • 3. Precision & Recall FP FP FP FP FP FP This customer loves steak! No, this customer is vegan. Bad Product Recommendation → Less Conversion → Decrease in Sales FN FN FN FN FN FN The product has no defects. A customer called... He's angry. Bad Defect Detector → Bad Quality → Customer Dissatisfaction TP TP TP TP TP Common Goal Precision Goal Recall Goal We use both metrics when actual negatives are less relevant. For example, googling "Confusion Matrix" will have trillions of unrelated (negative) web pages, such as the "Best Pizza Recipe!" web page. Accounting for whether we have correctly predicted the latter webpage and alike as negative is impractical. 3 Author: Yousef Alghofaili
  • 4. Specificity & NPV FN FN FN FN FN FN No, they should be treated! Bad Diagnosis → No Treatment → Consequences FP FP FP FP FP FP This person is a criminal. They were detained for no reason. Bad Predictive Policing → Injustice Specificity Goal NPV Goal TN TN TN TN TN Common Goal We use both metrics when actual positives are less relevant. In essence, we aim to rule out a phenomenon. For example, we want to know how many healthy people (no disease detected) there are in a population. Or, how many trustworthy websites (not fraudulent) is someone visiting. They don't have cancer. 4 Author: Yousef Alghofaili
  • 5. Hacks Previously explained evaluation metrics, among many, are granular, as they focus on one angle of prediction quality which can mislead us into thinking that a predictive model is highly accurate. Generally, these metrics are not used solely. Let us see how easy it is to manipulate the aforementioned metrics. 5 Author: Yousef Alghofaili
  • 6. Get at least one positive sample correctly Predict almost all samples as negative TP≈1 TP≈1 FP≈0 100% Predicting positive samples with a high confidence threshold would potentially bring out this case. In addition, when positive samples are disproportionately higher than negatives, false positives will probabilistically be rarer. Hence, precision will tend to be high. Precision Negatives TN≈50 Positives FN≈49 TP≈1 Hacking 6 Author: Yousef Alghofaili Precision is the ratio of correctly classified positive samples to the total number of positive predictions. Hence the name, Positive Predictive Value. 50 Positive Samples 50 Negative Samples Dataset
  • 7. 7 Author: Yousef Alghofaili 50 Positive Samples 50 Negative Samples Dataset Recall Hacking Recall is the ratio of correctly classified positive samples to the total number of actual positive samples. Hence the name, True Positive Rate. Predict all samples as positive Negatives Positives TP=50 FP=50 TP=50 TP=50 FN=0 100% Similar to precision, when positive samples are disproportionately higher, the classifier would generally be biased towards positive class predictions to reduce the number of mistakes.
  • 8. 8 Author: Yousef Alghofaili 50 Positive Samples 50 Negative Samples Dataset Specificity Hacking Specificity is the ratio of correctly classified negative samples to the total number of actual negative samples. Hence the name, True Negative Rate. Predict all samples as negative Negatives Positives FN=50 TN=50 TN=50 TN=50 FP=0 100% Contrary to Recall (Sensitivity), Specificity focuses on the negative class. Hence, we face this problem when negative samples are disproportionately higher. Notice how the Balanced Accuracy metric intuitively solves this issue in subsequent pages.
  • 9. 9 Author: Yousef Alghofaili 50 Positive Samples 50 Negative Samples Dataset Negative Predictive Value is the ratio of correctly classified negative samples to the total number of negative predictions. Hence the name. NPV Hacking Get at least one negative sample correctly Predict almost all samples as positive Negatives Positives TP≈50 FP≈49 TN≈1 TN≈1 TN≈1 FN≈0 100% Predicting negative samples with a high confidence threshold has this case as a consequence. Also, when negative samples are disproportionately higher, false negatives will probabilistically be rarer. Thus, NPV will tend to be high.
  • 10. As we have seen above, some metrics can misinform us about the actual performance of a classifier. However, there are other metrics that include more information about the performance. Nevertheless, all metrics can be “hacked” in one way or another. Hence, we commonly report multiple metrics to observe multiple viewpoints of the model's performance. 10 Author: Yousef Alghofaili Comprehensive Metrics
  • 11. Accuracy TP TN TP TN FP FN Accuracy treats all error types (false positives and false negatives) as equal. However, equal is not always preferred. Since accuracy assigns equal cost to all error types, having significantly more positive samples than negatives will make accuracy biased towards the larger class. In fact, the Accuracy Paradox is a direct "hack" against the metric. Assume you have 99 samples of class 1 and 1 sample of class 0. If your classifier predicts everything as class 1, it will get an accuracy of 99%. Positives Negatives Accuracy Paradox 11 Author: Yousef Alghofaili
  • 12. F1-Score is asymmetric to the choice of which class is negative or positive. Changing the positive class into the negative one will not produce a similar score in most cases. Precision Recall Precision Recall 2 F1-Score does not account for true negatives. For example, correctly diagnosing a patient with no disease (true negative) has no impact on the F1-Score. Positives Negatives Negatives Positives Asymmetric Measure True Negatives Absence F1-Score F1-Score will combine precision and recall in a way that is sensitive to a decrease in any of the two (Harmonic Mean). Note that the issues mentioned below do apply to Fβ score in general. Extra: Know more about Fβ Score on Page 18 Author: Yousef Alghofaili 12
  • 13. Balanced Accuracy Balanced Accuracy accounts for the positive and negative classes independently using Sensitivity and Specificity, respectively. The metric partially solves the Accuracy paradox through independent calculation of error types and solves the true negative absence problem in Fβ -Score through the inclusion of Specificity. Sensitivity Specificity 2 Balanced Accuracy is commonly robust against imbalanced datasets, but that does not apply to the above-illustrated cases. Both models perform poorly at predicting one of the two (positive P or negative N) classes, therefore unreliable at one. Yet, Balanced Accuracy is 90%, which is misleading. Relative Differences in error types 13 9 1000 TP FP 1 TP FP 9000 9 TN FN 1000 9000 TN 1 FN Author: Yousef Alghofaili
  • 14. Matthews Correlation Coefficient (MCC) MCC calculates the correlation between the actual and predicted labels, which produces a number between -1 and 1. Hence, it will only produce a good score if the model is accurate in all confusion matrix components. MCC is the most robust metric against imbalanced dataset issues or random classifications. ( ) FP FN ( ) TP TN ( ) TP FP TP FN FP TN FN ( ) ( ) ( ) MCC faces an issue of it being undefined whenever a full row or a column in a confusion matrix is zeros. However, the issue is outside the scope of this document. Note that this is solved by simply substituting zeros with an arbitrarily small value. 14 TN Author: Yousef Alghofaili
  • 15. Conclusion We have gone through all confusion matrix components, discussed some of the most popular metrics, how easy it is for them to be "hacked", alternatives to overcome these problems through more generalized metrics, and each one's limitations. The key takeaways are: Recognize the hacks against granular metrics as you might fall into one unintentionally. Although these metrics are not solely used in reporting, they are heavily used in development settings to debug a classifier's behavior. Know the limitations of popular classification evaluations metrics used in reporting so that you become equipped with enough acumen to decide whether you have obtained the optimal classifier or not. Never get persuaded by the phrase "THE BEST" in the context of machine learning, especially evaluation metrics. Every metric approached in this document (including MCC) is the best metric only when it best fits the project's objective. 15 Author: Yousef Alghofaili
  • 16. References Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics, 21(1), 1-13. 16 Chicco, D., Tötsch, N., & Jurman, G. (2021). The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData mining, 14(1), 1-22. Lalkhen, A. G., & McCluskey, A. (2008). Clinical tests: sensitivity and specificity. Continuing education in anaesthesia critical care & pain, 8(6), 221-223. Hull, D. (1993, July). Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 329-338). Hossin, M., & Sulaiman, M. N. (2015). A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 5(2), 1. Jurman, G., Riccadonna, S., & Furlanello, C. (2012). A comparison of MCC and CEN error measures in multi-class prediction. Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData mining, 10(1), 1-17. Author: Yousef Alghofaili
  • 17. Yousef Alghofaili AI solutions architect and researcher who has studied at KFUPM and Georgia Institute of Technology. He worked with multiple research groups from KAUST, KSU, and KFUPM. He has also built and managed noura.ai data science R&D team as the AI Director. He is an official author at Towards Data Science Publication and developer of KMeansInterp Algorithm. THANK YOU! 17 For any feedback, issues, or inquiries, contact yousefalghofaili@gmail.com Author: Yousef Alghofaili Author Reviewer Dr. Motaz Alfarraj Assistant Professor at KFUPM, and the Acting Director of SDAIA- KFUPM Joint Research Center for Artificial Intelligence (JRC-AI). He has received his Bachelor's degree from KFUPM, and earned his Master's and Ph.D. degrees in Electrical Engineering, Digital Image Processing and Computer Vision from Georgia Institute of Technology. He has contributed to ML research as an author of many research papers and won many awards in his field. https://www.linkedin.com/in/yousefgh/ https://www.linkedin.com/in/yousefgh/ https://www.linkedin.com/in/yousefgh/ https://www.linkedin.com/in/yousefgh/ https://www.linkedin.com/in/yousefgh/ https://www.linkedin.com/in/yousefgh/ https://www.linkedin.com/in/yousefgh/ https://www.linkedin.com/in/motazalfarraj/ https://www.linkedin.com/in/motazalfarraj/ https://www.linkedin.com/in/motazalfarraj/ https://www.linkedin.com/in/motazalfarraj/ https://www.linkedin.com/in/motazalfarraj/ https://www.linkedin.com/in/motazalfarraj/ https://www.linkedin.com/in/motazalfarraj/ mailto:yousefalghofaili@gmail.com
  • 18. Fβ Score is the generalized form of F1 Score (Fβ = 1 Score) where the difference lies within the variability of the β Factor. The β Factor skews the final score into favoring recall β times over precision, enabling us to weigh the risk of having false negatives (Type II Errors) and false positives (Type I Errors) differently. Precision Recall ( 1 + β2 ) Precision Recall ( β2 ) β β β = 1 Precision is β times Less important than Recall Precision is β times More important than Recall Balanced F1-Score β β β = 1 Fβ Score has been originally developed to evaluate Information Retrieval (IR) systems such as Google Search Engine. When you search for a webpage, but it does not appear, you are experiencing the engine's low Recall. When the results you see are completely irrelevant, you are experiencing its low Precision. Hence, search engines play with the β Factor to optimize User Experience by favoring one of the two experiences you have had over another. Fβ Score Extra 18 Author: Yousef Alghofaili