SlideShare a Scribd company logo
Evaluation Metrics
CS229
Anand Avati
Topics
● Why?
● Binary classifiers
○ Rank view, Thresholding
● Metrics
○ Confusion Matrix
○ Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity, F-score
○ Summary metrics: AU-ROC, AU-PRC, Log-loss.
● Choosing Metrics
● Class Imbalance
○ Failure scenarios for each metric
● Multi-class
Why are metrics important?
- Training objective (cost function) is only a proxy for real world objective.
- Metrics help capture a business goal into a quantitative target (not all errors
are equal).
- Helps organize ML team effort towards that target.
- Generally in the form of improving that metric on the dev set.
- Useful to quantify the “gap” between:
- Desired performance and baseline (estimate effort initially).
- Desired performance and current performance.
- Measure progress over time (No Free Lunch Theorem).
- Useful for lower level tasks and debugging (like diagnosing bias vs variance).
- Ideally training objective should be the metric, but not always possible. Still,
metrics are useful and important for evaluation.
Binary Classification
● X is Input
● Y is binary Output (0/1)
● Model is ŷ = h(X)
● Two types of models
○ Models that output a categorical class directly (K Nearest neighbor, Decision tree)
○ Models that output a real valued score (SVM, Logistic Regression)
■ Score could be margin (SVM), probability (LR, NN)
■ Need to pick a threshold
■ We focus on this type (the other type can be interpreted as an instance)
Score based models
Score = 1
Positive labelled example
Negative labelled example
Score = 0
Prevalence =
#positives
#positives + #negatives
Score based models : Classifier
Label positive Label negative
Th=0.5
Th
0.5
Predict
Negative
Predict
Positive
Point metrics: Confusion Matrix
Label Positive Label Negative
Predict
Negative
Predict
Positive
9
8
2
1
Th=0.5
Th
0.5
Point metrics: True Positives
Label positive Label negative
9
8
2
1
Th=0.5
Th
0.5
TP
9
Predict
Negative
Predict
Positive
Point metrics: True Negatives
Label positive Label negative
9
8
2
1
Th=0.5
Th
0.5
TP
9
TN
8
Predict
Negative
Predict
Positive
Point metrics: False Positives
Label positive Label negative
9
8
2
1
Th=0.5
Th
0.5
TP
9
TN
8
Predict
Negative
Predict
Positive
FP
2
Point metrics: False Negatives
Label positive Label negative
9
8
2
1
Th=0.5
Th
0.5
TP
9
TN
8
Predict
Negative
Predict
Positive
FP
2
FN
1
FP and FN also called Type-1 and Type-2 errors
Could not find true source of image to cite
Point metrics: Accuracy
Label positive Label negative
9
8
2
1
Th=0.5
Th
0.5
TP
9
TN
8
Predict
Negative
Predict
Positive
FP
2
FN
1
Acc
.85
Point metrics: Precision
Label positive Label negative
9
8
2
1
Th=0.5
Th
0.5
TP
9
TN
8
Predict
Negative
Predict
Positive
FP
2
FN
1
Acc
.85
Pr
.81
Point metrics: Positive Recall (Sensitivity)
Label positive Label negative
9
8
2
1
Th=0.5
Th
0.5
TP
9
TN
8
Predict
Negative
Predict
Positive
FP
2
FN
1
Acc
.85
Pr
.81
Recall
.9
Point metrics: Negative Recall (Specificity)
Label positive Label negative
9
8
2
1
Th=0.5
Th
0.5
TP
9
TN
8
Predict
Negative
Predict
Positive
FP
2
FN
1
Acc
.85
Pr
.81
Recall
.9
Spec
0.8
Point metrics: F score
Label positive Label negative
9
8
2
1
Th=0.5
Th
0.5
TP
9
TN
8
Predict
Negative
Predict
Positive
FP
2
FN
1
Acc
.85
Pr
.81
Recall
.9
Spec
.8
F1
.857
Point metrics: Changing threshold
Label positive Label negative
7
8
2
3
Th=0.6
Th
0.6
TP
7
TN
8
Predict
Negative
Predict
Positive
FP
2
FN
3
Acc
.75
Pr
.77
Recall
.7
Spec
.8
F1
.733
Threshold TP TN FP FN Accuracy Precision Recall Specificity F1
1.00 0 10 0 10 0.50 1 0 1 0
0.95 1 10 0 9 0.55 1 0.1 1 0.182
0.90 2 10 0 8 0.60 1 0.2 1 0.333
0.85 2 9 1 8 0.55 0.667 0.2 0.9 0.308
0.80 3 9 1 7 0.60 0.750 0.3 0.9 0.429
0.75 4 9 1 6 0.65 0.800 0.4 0.9 0.533
0.70 5 9 1 5 0.70 0.833 0.5 0.9 0.625
0.65 5 8 2 5 0.65 0.714 0.5 0.8 0.588
0.60 6 8 2 4 0.70 0.750 0.6 0.8 0.667
0.55 7 8 2 3 0.75 0.778 0.7 0.8 0.737
0.50 8 8 2 2 0.80 0.800 0.8 0.8 0.800
0.45 9 8 2 1 0.85 0.818 0.9 0.8 0.857
0.40 9 7 3 1 0.80 0.750 0.9 0.7 0.818
0.35 9 6 4 1 0.75 0.692 0.9 0.6 0.783
0.30 9 5 5 1 0.70 0.643 0.9 0.5 0.750
0.25 9 4 6 1 0.65 0.600 0.9 0.4 0.720
0.20 9 3 7 1 0.60 0.562 0.9 0.3 0.692
0.15 9 2 8 1 0.55 0.529 0.9 0.2 0.667
0.10 9 1 9 1 0.50 0.500 0.9 0.1 0.643
0.05 10 1 9 0 0.55 0.526 1 0.1 0.690
0.00 10 0 10 0 0.50 0.500 1 0 0.667
Score = 1
Score = 0
Threshold = 0.00
Threshold = 1.00
Threshold Scanning
How to summarize the trade-off?
{Precision, Specificity} vs Recall/Sensitivity
Summary metrics: ROC (rotated version)
Score = 1
Score = 0
Summary metrics: PRC
Score = 1
Score = 0
Summary metrics: Log-Loss motivation
Score = 1
Score = 0
Score = 1
Score = 0
Two models scoring the same data set. Is one of them better than the other?
Model A Model B
Summary metrics: Log-Loss
● These two model outputs have same ranking, and
therefore the same AU-ROC, AU-PRC, accuracy!
● Gain =
● Log loss rewards confident correct answers and
heavily penalizes confident wrong answers.
● exp(E[log-loss]) is G.M. of gains, in [0,1].
● One perfectly confident wrong prediction is fatal.
● Gaining popularity as an evaluation metric (Kaggle)
Score = 1
Score = 0
Score = 1
Score = 0
Calibration
Logistic (th=0.5):
Precision: 0.872
Recall: 0.851
F1: 0.862
Brier: 0.099
SVC (th=0.5):
Precision: 0.872
Recall: 0.852
F1: 0.862
Brier: 0.163
Brier = MSE(p, y)
Unsupervised Learning
● logP(x) is a measure of fit in Probabilistic models (GMM, Factor Analysis)
○ High logP(x) on training set, but low logP(x) on test set is a measure of overfitting
○ Raw value of logP(x) hard to interpret in isolation
● K-means is trickier (because of fixed covariance assumption)
Class Imbalance: Problems
Symptom: Prevalence < 5% (no strict definition)
Metrics: may not be meaningful.
Learning: may not focus on minority class examples at all (majority class can overwhelm
logistic regression, to a lesser extent SVM)
Class Imbalance: Metrics (pathological cases)
Accuracy: Blindly predict majority class.
Log-Loss: Majority class can dominate the loss.
AUROC: Easy to keep AUC high by scoring most negatives very low.
AUPRC: Somewhat more robust than AUROC. But other challenges.
- What kind of interpolation? AUCNPR?
In general: Accuracy << AUROC << AUPRC
Multi-class (few remarks)
● Confusion matrix will be NxN (still want heavy diagonals, light off-diagonals)
● Most metrics (except accuracy) generally analysed as multiple 1-vs-many.
● Multiclass variants of AUROC and AUPRC (micro vs macro averaging)
● Class imbalance is common (both in absolute, and relative sense)
● Cost sensitive learning techniques (also helps in Binary Imbalance)
○ Assign $$ value for each block in the confusion matrix, and incorporate those into the loss
function.
Choosing Metrics
Some common patterns:
- High precision is hard constraint, do best recall (e.g search engine results,
grammar correction) -- intolerant to FP. Metric: Recall at Precision=XX%
- High recall is hard constraint, do best precision (e.g medical diag). Intolerant
to FN. Metric: Precision at Recall=100%
- Capacity constrained (by K). Metric: Precision in top-K.
- Etc.
- Choose operating threshold based on above criteria.
Thank You!

More Related Content

Similar to EvaluationMetrics.pptx

Intro to data science
Intro to data scienceIntro to data science
Intro to data science
ANURAG SINGH
 
Introduction To Data Science Using R
Introduction To Data Science Using RIntroduction To Data Science Using R
Introduction To Data Science Using R
ANURAG SINGH
 
Evaluation of multilabel multi class classification
Evaluation of multilabel multi class classificationEvaluation of multilabel multi class classification
Evaluation of multilabel multi class classification
Sridhar Nomula
 
1645 track2 short
1645 track2 short1645 track2 short
1645 track2 short
Rising Media, Inc.
 
Towards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingTowards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model Checking
Akos Hajdu
 
Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's Guide
Megan Verbakel
 
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
PAPIs.io
 
Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
Dori Waldman
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummies
Michael Winer
 
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Seattle DAML meetup
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
Marco Quartulli
 
Graphical tools
Graphical toolsGraphical tools
Graphical tools
Cynthia Cumby
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
Aman Vasisht
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spss
Dr Nisha Arora
 
AUC: at what cost(s)?
AUC: at what cost(s)?AUC: at what cost(s)?
AUC: at what cost(s)?
Alexander Korbonits
 
Download the presentation
Download the presentationDownload the presentation
Download the presentationbutest
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedinAsoka Korale
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
Smarten Augmented Analytics
 
Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
VickyKumar131533
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
Machine Learning Valencia
 

Similar to EvaluationMetrics.pptx (20)

Intro to data science
Intro to data scienceIntro to data science
Intro to data science
 
Introduction To Data Science Using R
Introduction To Data Science Using RIntroduction To Data Science Using R
Introduction To Data Science Using R
 
Evaluation of multilabel multi class classification
Evaluation of multilabel multi class classificationEvaluation of multilabel multi class classification
Evaluation of multilabel multi class classification
 
1645 track2 short
1645 track2 short1645 track2 short
1645 track2 short
 
Towards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model CheckingTowards Evaluating Size Reduction Techniques for Software Model Checking
Towards Evaluating Size Reduction Techniques for Software Model Checking
 
Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's Guide
 
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
 
Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummies
 
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
Graphical tools
Graphical toolsGraphical tools
Graphical tools
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spss
 
AUC: at what cost(s)?
AUC: at what cost(s)?AUC: at what cost(s)?
AUC: at what cost(s)?
 
Download the presentation
Download the presentationDownload the presentation
Download the presentation
 
customer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedincustomer_profiling_based_on_fuzzy_principals_linkedin
customer_profiling_based_on_fuzzy_principals_linkedin
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
 
Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 

Recently uploaded

the IUA Administrative Board and General Assembly meeting
the IUA Administrative Board and General Assembly meetingthe IUA Administrative Board and General Assembly meeting
the IUA Administrative Board and General Assembly meeting
ssuser787e5c1
 
Dimensions of Healthcare Quality
Dimensions of Healthcare QualityDimensions of Healthcare Quality
Dimensions of Healthcare Quality
Naeemshahzad51
 
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
Guillermo Rivera
 
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
ranishasharma67
 
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICEJaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
ranishasharma67
 
ICH Guidelines for Pharmacovigilance.pdf
ICH Guidelines for Pharmacovigilance.pdfICH Guidelines for Pharmacovigilance.pdf
ICH Guidelines for Pharmacovigilance.pdf
NEHA GUPTA
 
Demystifying-Gene-Editing-The-Promise-and-Peril-of-CRISPR.pdf
Demystifying-Gene-Editing-The-Promise-and-Peril-of-CRISPR.pdfDemystifying-Gene-Editing-The-Promise-and-Peril-of-CRISPR.pdf
Demystifying-Gene-Editing-The-Promise-and-Peril-of-CRISPR.pdf
SasikiranMarri
 
Yemen National Tuberculosis Program .ppt
Yemen National Tuberculosis Program .pptYemen National Tuberculosis Program .ppt
Yemen National Tuberculosis Program .ppt
Esam43
 
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
Kumar Satyam
 
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdfCHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
Sachin Sharma
 
Artificial Intelligence to Optimize Cardiovascular Therapy
Artificial Intelligence to Optimize Cardiovascular TherapyArtificial Intelligence to Optimize Cardiovascular Therapy
Artificial Intelligence to Optimize Cardiovascular Therapy
Iris Thiele Isip-Tan
 
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
The Lifesciences Magazine
 
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
ranishasharma67
 
.Metabolic.disordersYYSSSFFSSSSSSSSSSDDD
.Metabolic.disordersYYSSSFFSSSSSSSSSSDDD.Metabolic.disordersYYSSSFFSSSSSSSSSSDDD
.Metabolic.disordersYYSSSFFSSSSSSSSSSDDD
samahesh1
 
ventilator, child on ventilator, newborn
ventilator, child on ventilator, newbornventilator, child on ventilator, newborn
ventilator, child on ventilator, newborn
Pooja Rani
 
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
Ameena Kadar
 
The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........
TheDocs
 
Nursing Care of Client With Acute And Chronic Renal Failure.ppt
Nursing Care of Client With Acute And Chronic Renal Failure.pptNursing Care of Client With Acute And Chronic Renal Failure.ppt
Nursing Care of Client With Acute And Chronic Renal Failure.ppt
Rommel Luis III Israel
 
CANCER CANCER CANCER CANCER CANCER CANCER
CANCER  CANCER  CANCER  CANCER  CANCER CANCERCANCER  CANCER  CANCER  CANCER  CANCER CANCER
CANCER CANCER CANCER CANCER CANCER CANCER
KRISTELLEGAMBOA2
 
GURGAON Call Girls ❤8901183002❤ #ℂALL# #gIRLS# In GURGAON ₹,2500 Cash Payment...
GURGAON Call Girls ❤8901183002❤ #ℂALL# #gIRLS# In GURGAON ₹,2500 Cash Payment...GURGAON Call Girls ❤8901183002❤ #ℂALL# #gIRLS# In GURGAON ₹,2500 Cash Payment...
GURGAON Call Girls ❤8901183002❤ #ℂALL# #gIRLS# In GURGAON ₹,2500 Cash Payment...
ranishasharma67
 

Recently uploaded (20)

the IUA Administrative Board and General Assembly meeting
the IUA Administrative Board and General Assembly meetingthe IUA Administrative Board and General Assembly meeting
the IUA Administrative Board and General Assembly meeting
 
Dimensions of Healthcare Quality
Dimensions of Healthcare QualityDimensions of Healthcare Quality
Dimensions of Healthcare Quality
 
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...
 
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
Contact ME {89011**83002} Haridwar ℂall Girls By Full Service Call Girl In Ha...
 
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICEJaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
Jaipur ❤cALL gIRLS 89O1183002 ❤ℂall Girls IN JaiPuR ESCORT SERVICE
 
ICH Guidelines for Pharmacovigilance.pdf
ICH Guidelines for Pharmacovigilance.pdfICH Guidelines for Pharmacovigilance.pdf
ICH Guidelines for Pharmacovigilance.pdf
 
Demystifying-Gene-Editing-The-Promise-and-Peril-of-CRISPR.pdf
Demystifying-Gene-Editing-The-Promise-and-Peril-of-CRISPR.pdfDemystifying-Gene-Editing-The-Promise-and-Peril-of-CRISPR.pdf
Demystifying-Gene-Editing-The-Promise-and-Peril-of-CRISPR.pdf
 
Yemen National Tuberculosis Program .ppt
Yemen National Tuberculosis Program .pptYemen National Tuberculosis Program .ppt
Yemen National Tuberculosis Program .ppt
 
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...
 
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdfCHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdf
 
Artificial Intelligence to Optimize Cardiovascular Therapy
Artificial Intelligence to Optimize Cardiovascular TherapyArtificial Intelligence to Optimize Cardiovascular Therapy
Artificial Intelligence to Optimize Cardiovascular Therapy
 
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
 
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
Haridwar ❤CALL Girls 🔝 89011★83002 🔝 ❤ℂall Girls IN Haridwar ESCORT SERVICE❤
 
.Metabolic.disordersYYSSSFFSSSSSSSSSSDDD
.Metabolic.disordersYYSSSFFSSSSSSSSSSDDD.Metabolic.disordersYYSSSFFSSSSSSSSSSDDD
.Metabolic.disordersYYSSSFFSSSSSSSSSSDDD
 
ventilator, child on ventilator, newborn
ventilator, child on ventilator, newbornventilator, child on ventilator, newborn
ventilator, child on ventilator, newborn
 
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......POLYCYSTIC OVARIAN SYNDROME (PCOS)......
POLYCYSTIC OVARIAN SYNDROME (PCOS)......
 
The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........The Docs PPG - 30.05.2024.pptx..........
The Docs PPG - 30.05.2024.pptx..........
 
Nursing Care of Client With Acute And Chronic Renal Failure.ppt
Nursing Care of Client With Acute And Chronic Renal Failure.pptNursing Care of Client With Acute And Chronic Renal Failure.ppt
Nursing Care of Client With Acute And Chronic Renal Failure.ppt
 
CANCER CANCER CANCER CANCER CANCER CANCER
CANCER  CANCER  CANCER  CANCER  CANCER CANCERCANCER  CANCER  CANCER  CANCER  CANCER CANCER
CANCER CANCER CANCER CANCER CANCER CANCER
 
GURGAON Call Girls ❤8901183002❤ #ℂALL# #gIRLS# In GURGAON ₹,2500 Cash Payment...
GURGAON Call Girls ❤8901183002❤ #ℂALL# #gIRLS# In GURGAON ₹,2500 Cash Payment...GURGAON Call Girls ❤8901183002❤ #ℂALL# #gIRLS# In GURGAON ₹,2500 Cash Payment...
GURGAON Call Girls ❤8901183002❤ #ℂALL# #gIRLS# In GURGAON ₹,2500 Cash Payment...
 

EvaluationMetrics.pptx

  • 2. Topics ● Why? ● Binary classifiers ○ Rank view, Thresholding ● Metrics ○ Confusion Matrix ○ Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity, F-score ○ Summary metrics: AU-ROC, AU-PRC, Log-loss. ● Choosing Metrics ● Class Imbalance ○ Failure scenarios for each metric ● Multi-class
  • 3. Why are metrics important? - Training objective (cost function) is only a proxy for real world objective. - Metrics help capture a business goal into a quantitative target (not all errors are equal). - Helps organize ML team effort towards that target. - Generally in the form of improving that metric on the dev set. - Useful to quantify the “gap” between: - Desired performance and baseline (estimate effort initially). - Desired performance and current performance. - Measure progress over time (No Free Lunch Theorem). - Useful for lower level tasks and debugging (like diagnosing bias vs variance). - Ideally training objective should be the metric, but not always possible. Still, metrics are useful and important for evaluation.
  • 4. Binary Classification ● X is Input ● Y is binary Output (0/1) ● Model is ŷ = h(X) ● Two types of models ○ Models that output a categorical class directly (K Nearest neighbor, Decision tree) ○ Models that output a real valued score (SVM, Logistic Regression) ■ Score could be margin (SVM), probability (LR, NN) ■ Need to pick a threshold ■ We focus on this type (the other type can be interpreted as an instance)
  • 5. Score based models Score = 1 Positive labelled example Negative labelled example Score = 0 Prevalence = #positives #positives + #negatives
  • 6. Score based models : Classifier Label positive Label negative Th=0.5 Th 0.5 Predict Negative Predict Positive
  • 7. Point metrics: Confusion Matrix Label Positive Label Negative Predict Negative Predict Positive 9 8 2 1 Th=0.5 Th 0.5
  • 8. Point metrics: True Positives Label positive Label negative 9 8 2 1 Th=0.5 Th 0.5 TP 9 Predict Negative Predict Positive
  • 9. Point metrics: True Negatives Label positive Label negative 9 8 2 1 Th=0.5 Th 0.5 TP 9 TN 8 Predict Negative Predict Positive
  • 10. Point metrics: False Positives Label positive Label negative 9 8 2 1 Th=0.5 Th 0.5 TP 9 TN 8 Predict Negative Predict Positive FP 2
  • 11. Point metrics: False Negatives Label positive Label negative 9 8 2 1 Th=0.5 Th 0.5 TP 9 TN 8 Predict Negative Predict Positive FP 2 FN 1
  • 12. FP and FN also called Type-1 and Type-2 errors Could not find true source of image to cite
  • 13. Point metrics: Accuracy Label positive Label negative 9 8 2 1 Th=0.5 Th 0.5 TP 9 TN 8 Predict Negative Predict Positive FP 2 FN 1 Acc .85
  • 14. Point metrics: Precision Label positive Label negative 9 8 2 1 Th=0.5 Th 0.5 TP 9 TN 8 Predict Negative Predict Positive FP 2 FN 1 Acc .85 Pr .81
  • 15. Point metrics: Positive Recall (Sensitivity) Label positive Label negative 9 8 2 1 Th=0.5 Th 0.5 TP 9 TN 8 Predict Negative Predict Positive FP 2 FN 1 Acc .85 Pr .81 Recall .9
  • 16. Point metrics: Negative Recall (Specificity) Label positive Label negative 9 8 2 1 Th=0.5 Th 0.5 TP 9 TN 8 Predict Negative Predict Positive FP 2 FN 1 Acc .85 Pr .81 Recall .9 Spec 0.8
  • 17. Point metrics: F score Label positive Label negative 9 8 2 1 Th=0.5 Th 0.5 TP 9 TN 8 Predict Negative Predict Positive FP 2 FN 1 Acc .85 Pr .81 Recall .9 Spec .8 F1 .857
  • 18. Point metrics: Changing threshold Label positive Label negative 7 8 2 3 Th=0.6 Th 0.6 TP 7 TN 8 Predict Negative Predict Positive FP 2 FN 3 Acc .75 Pr .77 Recall .7 Spec .8 F1 .733
  • 19. Threshold TP TN FP FN Accuracy Precision Recall Specificity F1 1.00 0 10 0 10 0.50 1 0 1 0 0.95 1 10 0 9 0.55 1 0.1 1 0.182 0.90 2 10 0 8 0.60 1 0.2 1 0.333 0.85 2 9 1 8 0.55 0.667 0.2 0.9 0.308 0.80 3 9 1 7 0.60 0.750 0.3 0.9 0.429 0.75 4 9 1 6 0.65 0.800 0.4 0.9 0.533 0.70 5 9 1 5 0.70 0.833 0.5 0.9 0.625 0.65 5 8 2 5 0.65 0.714 0.5 0.8 0.588 0.60 6 8 2 4 0.70 0.750 0.6 0.8 0.667 0.55 7 8 2 3 0.75 0.778 0.7 0.8 0.737 0.50 8 8 2 2 0.80 0.800 0.8 0.8 0.800 0.45 9 8 2 1 0.85 0.818 0.9 0.8 0.857 0.40 9 7 3 1 0.80 0.750 0.9 0.7 0.818 0.35 9 6 4 1 0.75 0.692 0.9 0.6 0.783 0.30 9 5 5 1 0.70 0.643 0.9 0.5 0.750 0.25 9 4 6 1 0.65 0.600 0.9 0.4 0.720 0.20 9 3 7 1 0.60 0.562 0.9 0.3 0.692 0.15 9 2 8 1 0.55 0.529 0.9 0.2 0.667 0.10 9 1 9 1 0.50 0.500 0.9 0.1 0.643 0.05 10 1 9 0 0.55 0.526 1 0.1 0.690 0.00 10 0 10 0 0.50 0.500 1 0 0.667 Score = 1 Score = 0 Threshold = 0.00 Threshold = 1.00 Threshold Scanning
  • 20. How to summarize the trade-off? {Precision, Specificity} vs Recall/Sensitivity
  • 21. Summary metrics: ROC (rotated version) Score = 1 Score = 0
  • 22. Summary metrics: PRC Score = 1 Score = 0
  • 23. Summary metrics: Log-Loss motivation Score = 1 Score = 0 Score = 1 Score = 0 Two models scoring the same data set. Is one of them better than the other? Model A Model B
  • 24. Summary metrics: Log-Loss ● These two model outputs have same ranking, and therefore the same AU-ROC, AU-PRC, accuracy! ● Gain = ● Log loss rewards confident correct answers and heavily penalizes confident wrong answers. ● exp(E[log-loss]) is G.M. of gains, in [0,1]. ● One perfectly confident wrong prediction is fatal. ● Gaining popularity as an evaluation metric (Kaggle) Score = 1 Score = 0 Score = 1 Score = 0
  • 25. Calibration Logistic (th=0.5): Precision: 0.872 Recall: 0.851 F1: 0.862 Brier: 0.099 SVC (th=0.5): Precision: 0.872 Recall: 0.852 F1: 0.862 Brier: 0.163 Brier = MSE(p, y)
  • 26. Unsupervised Learning ● logP(x) is a measure of fit in Probabilistic models (GMM, Factor Analysis) ○ High logP(x) on training set, but low logP(x) on test set is a measure of overfitting ○ Raw value of logP(x) hard to interpret in isolation ● K-means is trickier (because of fixed covariance assumption)
  • 27. Class Imbalance: Problems Symptom: Prevalence < 5% (no strict definition) Metrics: may not be meaningful. Learning: may not focus on minority class examples at all (majority class can overwhelm logistic regression, to a lesser extent SVM)
  • 28. Class Imbalance: Metrics (pathological cases) Accuracy: Blindly predict majority class. Log-Loss: Majority class can dominate the loss. AUROC: Easy to keep AUC high by scoring most negatives very low. AUPRC: Somewhat more robust than AUROC. But other challenges. - What kind of interpolation? AUCNPR? In general: Accuracy << AUROC << AUPRC
  • 29. Multi-class (few remarks) ● Confusion matrix will be NxN (still want heavy diagonals, light off-diagonals) ● Most metrics (except accuracy) generally analysed as multiple 1-vs-many. ● Multiclass variants of AUROC and AUPRC (micro vs macro averaging) ● Class imbalance is common (both in absolute, and relative sense) ● Cost sensitive learning techniques (also helps in Binary Imbalance) ○ Assign $$ value for each block in the confusion matrix, and incorporate those into the loss function.
  • 30. Choosing Metrics Some common patterns: - High precision is hard constraint, do best recall (e.g search engine results, grammar correction) -- intolerant to FP. Metric: Recall at Precision=XX% - High recall is hard constraint, do best precision (e.g medical diag). Intolerant to FN. Metric: Precision at Recall=100% - Capacity constrained (by K). Metric: Precision in top-K. - Etc. - Choose operating threshold based on above criteria.

Editor's Notes

  1. Only math necessary: Counting and ratios!
  2. Dots are examples: Green = positive label, gray = negative label. Consider score to be probability output of Logistic Regression. Position of example on scoring line based on model output. For most of the metrics, only the relative rank / ordering of positive/negative matter. Prevalence decides class imbalance. If too many examples, plot class-wise histogram instead. Rank view useful to start error analysis (start with the topmost negative example, or bottom-most positive example).
  3. Only after picking a threshold, we have a classifier. Once we have a classifier, we can obtain all the Point metrics of that classifier.
  4. All point metrics can be derived from the confusion matrix. Confusion matrix captures all the information about a classifier performance, but is not a scalar! Properties: Total Sum is Fixed (population). Column Sums are Fixed (class-wise population). Quality of model & selected threshold decide how columns are split into rows. We want diagonals to be “heavy”, off diagonals to be “light”.
  5. TP = Green examples correctly identified as green.
  6. TN = Gray examples correctly identified as gray. “Trueness” is along the diagonal.
  7. FP = Gray examples falsely identified as green.
  8. FN = Green examples falsely identified as gray.
  9. Accuracy = what fraction of all predictions did we get right? Improving accuracy is equivalent to having a 0-1 Loss function.
  10. Precision = quality of positive predictions. Also called PPV (Positive Predictive Value).
  11. Recall (sensitivity) - What fraction of all greens did we pick out? Terminology from lab tests: how sensitive is the test in detecting disease? Somewhat related to Precision ( both recall and precision involve TP) Trivial 100% recall = pull everybody above the threshold. Trivial 100% precision = push everybody below the threshold except 1 green on top (hopefully no gray above it!). Striving for good precision with 100% recall = pulling up the lowest green as high as possible in the ranking. Striving for good recall with 100% precision = pushing down the top most gray as low as possible in the ranking.
  12. Negative Recall (Specificity) - What fraction of all negatives in the population were we able to keep away? Note: Sensitivity and specificity operate in different sub-universes.
  13. Summarize precision and recall into single score F1-score: Harmonic mean of PR and REC G-score: Geometric mean of PR and REC
  14. Changing the threshold can result in a new confusion matrix, and new values for some of the metrics. Many threshold values are redundant (between two consecutively ranked examples). Number of effective thresholds = M + 1 (# of examples + 1)
  15. As we scan through all possible effective thresholds, we explore all the possible values the metrics can take on for the given model. Each row is specific to the threshold. Table is specific to the model (different model = different effective thresholds). Observations: Sensitivity monotonic, Specificity monotonic in opposite direction. Precision somewhat correlated with specificity, and not monotonic. Tradeoff between Sensitivity and Specificity, Precision and Recall.
  16. We may not want to pick a threshold up front, but still want to summarize the overall model performance.
  17. AUC = Area Under Curve. Also called C-Statistic (concordance score). Represents how well the results are ranked. If you pick a positive e.g by random, and negative e.g by random, AUC = probability that positive eg is ranked > negative example. Measure of quality of discrimination. Some people plot TPR (= Sensitivity) vs FPR (= 1 - Specificity). Thresholds are points on this curve. Each point represents one set of point metrics. Diagonal line = random guessing. Grid view (top left to bottom right, postive = move right, negative = move down)
  18. Precision Recall curve represents a different tradeoff. Think search engine results ranked. End of curve at right cannot be lower than prevalence. Area under PRC = Average Precision. Intuition: By randomly picking the threshold, what’s the expected precision? While sensitivity monotonically decreased with increasing recall (the number of correctly classified negative examples can only go down), precision can be “jagged”. Getting a bunch of positives in the sequence increases both precision and recall (hence curve climbs up slowly), but getting a bunch of negatives in the sequence drops the precision without increasing recall. Hence, by definition, curve has only slow climbs and vertical drops.
  19. Somehow the second model feels “better”.
  20. The metrics so far fail to capture this. Accuracy optimizes for: Model gets $1 for correct answer, $0 for wrong. Log-loss: Model makes a bet for $0.xy dollars that it’s prediction is correct. If correct, gets $0.xy dollars, else gets $(1-0.xy) dollar. Incentive structure is well suited to produce calibrated models. Take home is the GM of earned amounts. Low-bar for exp(E[log-loss]) is 0.5 Log-loss on validation set measure how much did this “betting engine” earn.
  21. View calibration plot and histogram together When model is well calibrated, then histogram indicates level of discrimination If model is not calibrated, histogram may not mean much at all (see SVC plot) Brier score is another measure of calibration. Both brier score and log-loss are Proper Scoring Rules.
  22. For accuracy, prevalence of majority class should be the low bar! AUROC: 10% prevalence. top-10% are all negatives, next K are all positives, followed by rest of the negatives. AUPRC = 0.9 even though practically useless.