SlideShare a Scribd company logo
Imbalanced Class Problem
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
Data Mining
Classification: Alternative Techniques
2/15/2021 Introduction to Data Mining, 2nd Edition 2
Class Imbalance Problem
 Lots of classification problems where the classes
are skewed (more records from one class than
another)
– Credit card fraud
– Intrusion detection
– Defective products in manufacturing assembly line
– COVID-19 test results on a random sample
 Key Challenge:
– Evaluation measures such as accuracy are not well-
suited for imbalanced class
2/15/2021 Introduction to Data Mining, 2nd Edition 3
Confusion Matrix
 Confusion Matrix:
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
2/15/2021 Introduction to Data Mining, 2nd Edition 4
Accuracy
 Most widely-used metric:
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a
(TP)
b
(FN)
Class=No c
(FP)
d
(TN)
FN
FP
TN
TP
TN
TP
d
c
b
a
d
a










Accuracy
2/15/2021 Introduction to Data Mining, 2nd Edition 5
Problem with Accuracy
 Consider a 2-class problem
– Number of Class NO examples = 990
– Number of Class YES examples = 10
 If a model predicts everything to be class NO, accuracy is
990/1000 = 99 %
– This is misleading because this trivial model does not detect any class
YES example
– Detecting the rare class is usually more interesting (e.g., frauds,
intrusions, defects, etc)
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 0 10
Class=No 0 990
2/15/2021 Introduction to Data Mining, 2nd Edition 6
Which model is better?
PREDICTED
ACTUAL
Class=Yes Class=No
Class=Yes 0 10
Class=No 0 990
PREDICTED
ACTUAL
Class=Yes Class=No
Class=Yes 10 0
Class=No 500 490
A
B
Accuracy: 99%
Accuracy: 50%
2/15/2021 Introduction to Data Mining, 2nd Edition 7
Which model is better?
PREDICTED
ACTUAL
Class=Yes Class=No
Class=Yes 5 5
Class=No 0 990
PREDICTED
ACTUAL
Class=Yes Class=No
Class=Yes 10 0
Class=No 500 490
A
B
2/15/2021 Introduction to Data Mining, 2nd Edition 8
Alternative Measures
c
b
a
a
p
r
rp
b
a
a
c
a
a









2
2
2
(F)
measure
-
F
(r)
Recall
(p)
Precision
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
2/15/2021 Introduction to Data Mining, 2nd Edition 9
Alternative Measures
99
.
0
1000
990
Accuracy
62
.
0
5
.
0
1
5
.
0
*
1
*
2
(F)
measure
-
F
1
0
10
10
(r)
Recall
5
.
0
10
10
10
(p)
Precision











PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 10 0
Class=No 10 980
2/15/2021 Introduction to Data Mining, 2nd Edition 10
Alternative Measures
99
.
0
1000
990
Accuracy
62
.
0
5
.
0
1
5
.
0
*
1
*
2
(F)
measure
-
F
1
0
10
10
(r)
Recall
5
.
0
10
10
10
(p)
Precision











PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 10 0
Class=No 10 980
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 1 9
Class=No 0 990
991
.
0
1000
991
Accuracy
18
.
0
1
.
0
1
1
*
1
.
0
*
2
(F)
measure
-
F
1
.
0
9
1
1
(r)
Recall
1
0
1
1
(p)
Precision











2/15/2021 Introduction to Data Mining, 2nd Edition 11
Which of these classifiers is better?
8
.
0
Accuracy
8
.
0
(F)
measure
-
F
8
.
0
(r)
Recall
8
.
0
(p)
Precision




PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 40 10
Class=No 10 40
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 40 10
Class=No 1000 4000
8
.
0
~
Accuracy
08
.
0
~
(F)
measure
-
F
8
.
0
(r)
Recall
04
.
0
~
(p)
Precision




A
B
2/15/2021 Introduction to Data Mining, 2nd Edition 12
Measures of Classification Performance
PREDICTED CLASS
ACTUAL
CLASS
Yes No
Yes TP FN
No FP TN
 is the probability that we reject
the null hypothesis when it is
true. This is a Type I error or a
false positive (FP).
 is the probability that we
accept the null hypothesis when
it is false. This is a Type II error
or a false negative (FN).
2/15/2021 Introduction to Data Mining, 2nd Edition 13
Alternative Measures
Precision (p) = 0.8
TPR = Recall (r) = 0.8
FPR = 0.2
F−measure (F) = 0.8
Accuracy = 0.8
A PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 40 10
Class=No 10 40
B PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 40 10
Class=No 1000 4000
Precision (p) = 0.038
TPR = Recall (r) = 0.8
FPR = 0.2
F−measure (F) = 0.07
Accuracy = 0.8
TPR
FPR
= 4
TPR
FPR
= 4
2/15/2021 Introduction to Data Mining, 2nd Edition 14
Which of these classifiers is better?
2
.
0
FPR
2
.
0
(r)
Recall
TPR
5
.
0
(p)
Precision




A PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 10 40
Class=No 10 40
B PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 25 25
Class=No 25 25
C PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 40 10
Class=No 40 10
5
.
0
FPR
5
.
0
(r)
Recall
TPR
5
.
0
(p)
Precision




8
.
0
FPR
8
.
0
(r)
Recall
TPR
5
.
0
(p)
Precision




0.28
measure
F 

0.5
measure
F 

0.61
measure
F 

2/15/2021 Introduction to Data Mining, 2nd Edition 15
ROC (Receiver Operating Characteristic)
 A graphical approach for displaying trade-off
between detection rate and false alarm rate
 Developed in 1950s for signal detection theory to
analyze noisy signals
 ROC curve plots TPR against FPR
– Performance of a model represented as a point in an
ROC curve
2/15/2021 Introduction to Data Mining, 2nd Edition 16
ROC Curve
(TPR,FPR):
 (0,0): declare everything
to be negative class
 (1,1): declare everything
to be positive class
 (1,0): ideal
 Diagonal line:
– Random guessing
– Below diagonal line:
 prediction is opposite
of the true class
2/15/2021 Introduction to Data Mining, 2nd Edition 17
ROC (Receiver Operating Characteristic)
 To draw ROC curve, classifier must produce
continuous-valued output
– Outputs are used to rank test records, from the most likely
positive class record to the least likely positive class record
– By using different thresholds on this value, we can create
different variations of the classifier with TPR/FPR tradeoffs
 Many classifiers produce only discrete outputs (i.e.,
predicted class)
– How to get continuous-valued outputs?
 Decision trees, rule-based classifiers, neural networks,
Bayesian classifiers, k-nearest neighbors, SVM
2/15/2021 Introduction to Data Mining, 2nd Edition 18
Example: Decision Trees
x1 < 13.29 x2 < 17.35
x2 < 12.63
x1 < 6.56
x2 < 8.64
x2 < 1.38
x1 < 2.15
x1 < 7.24
x1 < 12.11
x1 < 18.88
0.107
0.143 0.669
0.164
0.059
0.071
0.727
0.271
0.654 0
0.220
x1 < 13.29 x2 < 17.35
x2 < 12.63
x1 < 6.56
x2 < 8.64
x2 < 1.38
x1 < 2.15
x1 < 7.24
x1 < 12.11
x1 < 18.88
Decision Tree
Continuous-valued outputs
2/15/2021 Introduction to Data Mining, 2nd Edition 19
ROC Curve Example
x1 < 13.29 x2 < 17.35
x2 < 12.63
x1 < 6.56
x2 < 8.64
x2 < 1.38
x1 < 2.15
x1 < 7.24
x1 < 12.11
x1 < 18.88
0.107
0.143 0.669
0.164
0.059
0.071
0.727
0.271
0.654 0
0.220
2/15/2021 Introduction to Data Mining, 2nd Edition 20
ROC Curve Example
At threshold t:
TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88
- 1-dimensional data set containing 2 classes (positive and negative)
- Any points located at x > t is classified as positive
2/15/2021 Introduction to Data Mining, 2nd Edition 21
How to Construct an ROC curve
Instance Score True Class
1 0.95 +
2 0.93 +
3 0.87 -
4 0.85 -
5 0.85 -
6 0.85 +
7 0.76 -
8 0.53 +
9 0.43 -
10 0.25 +
• Use a classifier that produces a
continuous-valued score for
each instance
• The more likely it is for the
instance to be in the + class, the
higher the score
• Sort the instances in decreasing
order according to the score
• Apply a threshold at each unique
value of the score
• Count the number of TP, FP,
TN, FN at each threshold
• TPR = TP/(TP+FN)
• FPR = FP/(FP + TN)
2/15/2021 Introduction to Data Mining, 2nd Edition 22
How to construct an ROC curve
Class + - + - - - + - + +
P
0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0
Threshold >=
ROC Curve:
2/15/2021 Introduction to Data Mining, 2nd Edition 23
Using ROC for Model Comparison
 No model consistently
outperforms the other
 M1 is better for
small FPR
 M2 is better for
large FPR
 Area Under the ROC
curve (AUC)
 Ideal:
 Area = 1
 Random guess:
 Area = 0.5
2/15/2021 Introduction to Data Mining, 2nd Edition 24
Dealing with Imbalanced Classes - Summary
 Many measures exists, but none of them may be ideal in
all situations
– Random classifiers can have high value for many of these measures
– TPR/FPR provides important information but may not be sufficient by
itself in many practical scenarios
– Given two classifiers, sometimes you can tell that one of them is
strictly better than the other
C1 is strictly better than C2 if C1 has strictly better TPR and FPR relative to C2 (or same
TPR and better FPR, and vice versa)
– Even if C1 is strictly better than C2, C1’s F-value can be worse than
C2’s if they are evaluated on data sets with different imbalances
– Classifier C1 can be better or worse than C2 depending on the scenario
at hand (class imbalance, importance of TP vs FP, cost/time tradeoffs)
2/15/2021 Introduction to Data Mining, 2nd Edition 25
Which Classifer is better?
01
.
0
FPR
5
.
0
(r)
Recall
TPR
98
.
0
(p)
Precision




T1 PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 50 50
Class=No 1 99
T2 PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 99 1
Class=No 10 90
T3 PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 99 1
Class=No 1 99
1
.
0
FPR
99
.
0
(r)
Recall
TPR
9
.
0
(p)
Precision




01
.
0
FPR
99
.
0
(r)
Recall
TPR
99
.
0
(p)
Precision




0.66
measure
F
50
TPR/FPR



0.94
measure
F
9.9
TPR/FPR



0.99
measure
F
99
TPR/FPR



2/15/2021 Introduction to Data Mining, 2nd Edition 26
Which Classifer is better? Medium Skew case
01
.
0
FPR
5
.
0
(r)
Recall
TPR
83
.
0
(p)
Precision




T1 PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 50 50
Class=No 10 990
T2 PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 99 1
Class=No 100 900
T3 PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 99 1
Class=No 10 990
1
.
0
FPR
99
.
0
(r)
Recall
TPR
5
.
0
(p)
Precision




01
.
0
FPR
99
.
0
(r)
Recall
TPR
9
.
0
(p)
Precision




0.62
measure
F
50
TPR/FPR



0.66
measure
F
9.9
TPR/FPR



0.94
measure
F
99
TPR/FPR



2/15/2021 Introduction to Data Mining, 2nd Edition 27
Which Classifer is better? High Skew case
01
.
0
FPR
5
.
0
(r)
Recall
TPR
3
.
0
(p)
Precision




T1 PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 50 50
Class=No 100 9900
T2 PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 99 1
Class=No 1000 9000
T3 PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes 99 1
Class=No 100 9900
1
.
0
FPR
99
.
0
(r)
Recall
TPR
09
.
0
(p)
Precision




01
.
0
FPR
99
.
0
(r)
Recall
TPR
5
.
0
(p)
Precision




0.375
measure
F
50
TPR/FPR



0.165
measure
F
9.9
TPR/FPR



0.66
measure
F
99
TPR/FPR



2/15/2021 Introduction to Data Mining, 2nd Edition 28
Building Classifiers with Imbalanced Training Set
 Modify the distribution of training data so that rare
class is well-represented in training set
– Undersample the majority class
– Oversample the rare class

More Related Content

Similar to chap4_imbalanced_classes.pptx

chap6_advanced_association_analysis.pptx
chap6_advanced_association_analysis.pptxchap6_advanced_association_analysis.pptx
chap6_advanced_association_analysis.pptx
GautamDematti1
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
shuchismitjha2
 
Guide
GuideGuide
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Thomas Ploetz
 
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료
taeseon ryu
 
Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's Guide
Megan Verbakel
 
Data Science Cheatsheet.pdf
Data Science Cheatsheet.pdfData Science Cheatsheet.pdf
Data Science Cheatsheet.pdf
qawali1
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
Collin Bennett
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
datamining-lect11.pptx
datamining-lect11.pptxdatamining-lect11.pptx
datamining-lect11.pptx
RithikRaj25
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosis
Dr.Pooja Jain
 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLT
IRJET Journal
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
arogozhnikov
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
butest
 
06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf
AlexanderLerch4
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]
Dongmin Choi
 
A hybrid sine cosine optimization algorithm for solving global optimization p...
A hybrid sine cosine optimization algorithm for solving global optimization p...A hybrid sine cosine optimization algorithm for solving global optimization p...
A hybrid sine cosine optimization algorithm for solving global optimization p...
Aboul Ella Hassanien
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEP
arogozhnikov
 
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

Maxim Kazantsev
 
Comparison GUM versus GUM+1
Comparison GUM  versus GUM+1Comparison GUM  versus GUM+1
Comparison GUM versus GUM+1
Maurice Maeck
 

Similar to chap4_imbalanced_classes.pptx (20)

chap6_advanced_association_analysis.pptx
chap6_advanced_association_analysis.pptxchap6_advanced_association_analysis.pptx
chap6_advanced_association_analysis.pptx
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
 
Guide
GuideGuide
Guide
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료Detection focal loss 딥러닝 논문읽기 모임 발표자료
Detection focal loss 딥러닝 논문읽기 모임 발표자료
 
Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's Guide
 
Data Science Cheatsheet.pdf
Data Science Cheatsheet.pdfData Science Cheatsheet.pdf
Data Science Cheatsheet.pdf
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
datamining-lect11.pptx
datamining-lect11.pptxdatamining-lect11.pptx
datamining-lect11.pptx
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosis
 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLT
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf06-00-ACA-Evaluation.pdf
06-00-ACA-Evaluation.pdf
 
Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]Review: Incremental Few-shot Instance Segmentation [CDM]
Review: Incremental Few-shot Instance Segmentation [CDM]
 
A hybrid sine cosine optimization algorithm for solving global optimization p...
A hybrid sine cosine optimization algorithm for solving global optimization p...A hybrid sine cosine optimization algorithm for solving global optimization p...
A hybrid sine cosine optimization algorithm for solving global optimization p...
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEP
 
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

 
Comparison GUM versus GUM+1
Comparison GUM  versus GUM+1Comparison GUM  versus GUM+1
Comparison GUM versus GUM+1
 

Recently uploaded

国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
zoowe
 
Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!
Toptal Tech
 
Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
davidjhones387
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
Danica Gill
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
xjq03c34
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
cuobya
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
zyfovom
 
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
ukwwuq
 
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
bseovas
 
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
uehowe
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
Laura Szabó
 
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
cuobya
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
Trish Parr
 
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
fovkoyb
 
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
ysasp1
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
vmemo1
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
Trending Blogers
 

Recently uploaded (20)

国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
 
Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!
 
Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
 
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
制作原版1:1(Monash毕业证)莫纳什大学毕业证成绩单办理假
 
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
 
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
办理毕业证(UPenn毕业证)宾夕法尼亚大学毕业证成绩单快速办理
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
 
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
 
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
 
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
成绩单ps(UST毕业证)圣托马斯大学毕业证成绩单快速办理
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
 

chap4_imbalanced_classes.pptx

  • 1. Imbalanced Class Problem Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Classification: Alternative Techniques
  • 2. 2/15/2021 Introduction to Data Mining, 2nd Edition 2 Class Imbalance Problem  Lots of classification problems where the classes are skewed (more records from one class than another) – Credit card fraud – Intrusion detection – Defective products in manufacturing assembly line – COVID-19 test results on a random sample  Key Challenge: – Evaluation measures such as accuracy are not well- suited for imbalanced class
  • 3. 2/15/2021 Introduction to Data Mining, 2nd Edition 3 Confusion Matrix  Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a b Class=No c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
  • 4. 2/15/2021 Introduction to Data Mining, 2nd Edition 4 Accuracy  Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) FN FP TN TP TN TP d c b a d a           Accuracy
  • 5. 2/15/2021 Introduction to Data Mining, 2nd Edition 5 Problem with Accuracy  Consider a 2-class problem – Number of Class NO examples = 990 – Number of Class YES examples = 10  If a model predicts everything to be class NO, accuracy is 990/1000 = 99 % – This is misleading because this trivial model does not detect any class YES example – Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc) PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 0 10 Class=No 0 990
  • 6. 2/15/2021 Introduction to Data Mining, 2nd Edition 6 Which model is better? PREDICTED ACTUAL Class=Yes Class=No Class=Yes 0 10 Class=No 0 990 PREDICTED ACTUAL Class=Yes Class=No Class=Yes 10 0 Class=No 500 490 A B Accuracy: 99% Accuracy: 50%
  • 7. 2/15/2021 Introduction to Data Mining, 2nd Edition 7 Which model is better? PREDICTED ACTUAL Class=Yes Class=No Class=Yes 5 5 Class=No 0 990 PREDICTED ACTUAL Class=Yes Class=No Class=Yes 10 0 Class=No 500 490 A B
  • 8. 2/15/2021 Introduction to Data Mining, 2nd Edition 8 Alternative Measures c b a a p r rp b a a c a a          2 2 2 (F) measure - F (r) Recall (p) Precision PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a b Class=No c d
  • 9. 2/15/2021 Introduction to Data Mining, 2nd Edition 9 Alternative Measures 99 . 0 1000 990 Accuracy 62 . 0 5 . 0 1 5 . 0 * 1 * 2 (F) measure - F 1 0 10 10 (r) Recall 5 . 0 10 10 10 (p) Precision            PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 10 0 Class=No 10 980
  • 10. 2/15/2021 Introduction to Data Mining, 2nd Edition 10 Alternative Measures 99 . 0 1000 990 Accuracy 62 . 0 5 . 0 1 5 . 0 * 1 * 2 (F) measure - F 1 0 10 10 (r) Recall 5 . 0 10 10 10 (p) Precision            PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 10 0 Class=No 10 980 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 1 9 Class=No 0 990 991 . 0 1000 991 Accuracy 18 . 0 1 . 0 1 1 * 1 . 0 * 2 (F) measure - F 1 . 0 9 1 1 (r) Recall 1 0 1 1 (p) Precision           
  • 11. 2/15/2021 Introduction to Data Mining, 2nd Edition 11 Which of these classifiers is better? 8 . 0 Accuracy 8 . 0 (F) measure - F 8 . 0 (r) Recall 8 . 0 (p) Precision     PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 40 10 Class=No 10 40 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 40 10 Class=No 1000 4000 8 . 0 ~ Accuracy 08 . 0 ~ (F) measure - F 8 . 0 (r) Recall 04 . 0 ~ (p) Precision     A B
  • 12. 2/15/2021 Introduction to Data Mining, 2nd Edition 12 Measures of Classification Performance PREDICTED CLASS ACTUAL CLASS Yes No Yes TP FN No FP TN  is the probability that we reject the null hypothesis when it is true. This is a Type I error or a false positive (FP).  is the probability that we accept the null hypothesis when it is false. This is a Type II error or a false negative (FN).
  • 13. 2/15/2021 Introduction to Data Mining, 2nd Edition 13 Alternative Measures Precision (p) = 0.8 TPR = Recall (r) = 0.8 FPR = 0.2 F−measure (F) = 0.8 Accuracy = 0.8 A PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 40 10 Class=No 10 40 B PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 40 10 Class=No 1000 4000 Precision (p) = 0.038 TPR = Recall (r) = 0.8 FPR = 0.2 F−measure (F) = 0.07 Accuracy = 0.8 TPR FPR = 4 TPR FPR = 4
  • 14. 2/15/2021 Introduction to Data Mining, 2nd Edition 14 Which of these classifiers is better? 2 . 0 FPR 2 . 0 (r) Recall TPR 5 . 0 (p) Precision     A PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 10 40 Class=No 10 40 B PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 25 25 Class=No 25 25 C PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 40 10 Class=No 40 10 5 . 0 FPR 5 . 0 (r) Recall TPR 5 . 0 (p) Precision     8 . 0 FPR 8 . 0 (r) Recall TPR 5 . 0 (p) Precision     0.28 measure F   0.5 measure F   0.61 measure F  
  • 15. 2/15/2021 Introduction to Data Mining, 2nd Edition 15 ROC (Receiver Operating Characteristic)  A graphical approach for displaying trade-off between detection rate and false alarm rate  Developed in 1950s for signal detection theory to analyze noisy signals  ROC curve plots TPR against FPR – Performance of a model represented as a point in an ROC curve
  • 16. 2/15/2021 Introduction to Data Mining, 2nd Edition 16 ROC Curve (TPR,FPR):  (0,0): declare everything to be negative class  (1,1): declare everything to be positive class  (1,0): ideal  Diagonal line: – Random guessing – Below diagonal line:  prediction is opposite of the true class
  • 17. 2/15/2021 Introduction to Data Mining, 2nd Edition 17 ROC (Receiver Operating Characteristic)  To draw ROC curve, classifier must produce continuous-valued output – Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record – By using different thresholds on this value, we can create different variations of the classifier with TPR/FPR tradeoffs  Many classifiers produce only discrete outputs (i.e., predicted class) – How to get continuous-valued outputs?  Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM
  • 18. 2/15/2021 Introduction to Data Mining, 2nd Edition 18 Example: Decision Trees x1 < 13.29 x2 < 17.35 x2 < 12.63 x1 < 6.56 x2 < 8.64 x2 < 1.38 x1 < 2.15 x1 < 7.24 x1 < 12.11 x1 < 18.88 0.107 0.143 0.669 0.164 0.059 0.071 0.727 0.271 0.654 0 0.220 x1 < 13.29 x2 < 17.35 x2 < 12.63 x1 < 6.56 x2 < 8.64 x2 < 1.38 x1 < 2.15 x1 < 7.24 x1 < 12.11 x1 < 18.88 Decision Tree Continuous-valued outputs
  • 19. 2/15/2021 Introduction to Data Mining, 2nd Edition 19 ROC Curve Example x1 < 13.29 x2 < 17.35 x2 < 12.63 x1 < 6.56 x2 < 8.64 x2 < 1.38 x1 < 2.15 x1 < 7.24 x1 < 12.11 x1 < 18.88 0.107 0.143 0.669 0.164 0.059 0.071 0.727 0.271 0.654 0 0.220
  • 20. 2/15/2021 Introduction to Data Mining, 2nd Edition 20 ROC Curve Example At threshold t: TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88 - 1-dimensional data set containing 2 classes (positive and negative) - Any points located at x > t is classified as positive
  • 21. 2/15/2021 Introduction to Data Mining, 2nd Edition 21 How to Construct an ROC curve Instance Score True Class 1 0.95 + 2 0.93 + 3 0.87 - 4 0.85 - 5 0.85 - 6 0.85 + 7 0.76 - 8 0.53 + 9 0.43 - 10 0.25 + • Use a classifier that produces a continuous-valued score for each instance • The more likely it is for the instance to be in the + class, the higher the score • Sort the instances in decreasing order according to the score • Apply a threshold at each unique value of the score • Count the number of TP, FP, TN, FN at each threshold • TPR = TP/(TP+FN) • FPR = FP/(FP + TN)
  • 22. 2/15/2021 Introduction to Data Mining, 2nd Edition 22 How to construct an ROC curve Class + - + - - - + - + + P 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 FN 0 1 1 2 2 2 2 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 Threshold >= ROC Curve:
  • 23. 2/15/2021 Introduction to Data Mining, 2nd Edition 23 Using ROC for Model Comparison  No model consistently outperforms the other  M1 is better for small FPR  M2 is better for large FPR  Area Under the ROC curve (AUC)  Ideal:  Area = 1  Random guess:  Area = 0.5
  • 24. 2/15/2021 Introduction to Data Mining, 2nd Edition 24 Dealing with Imbalanced Classes - Summary  Many measures exists, but none of them may be ideal in all situations – Random classifiers can have high value for many of these measures – TPR/FPR provides important information but may not be sufficient by itself in many practical scenarios – Given two classifiers, sometimes you can tell that one of them is strictly better than the other C1 is strictly better than C2 if C1 has strictly better TPR and FPR relative to C2 (or same TPR and better FPR, and vice versa) – Even if C1 is strictly better than C2, C1’s F-value can be worse than C2’s if they are evaluated on data sets with different imbalances – Classifier C1 can be better or worse than C2 depending on the scenario at hand (class imbalance, importance of TP vs FP, cost/time tradeoffs)
  • 25. 2/15/2021 Introduction to Data Mining, 2nd Edition 25 Which Classifer is better? 01 . 0 FPR 5 . 0 (r) Recall TPR 98 . 0 (p) Precision     T1 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 50 50 Class=No 1 99 T2 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 99 1 Class=No 10 90 T3 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 99 1 Class=No 1 99 1 . 0 FPR 99 . 0 (r) Recall TPR 9 . 0 (p) Precision     01 . 0 FPR 99 . 0 (r) Recall TPR 99 . 0 (p) Precision     0.66 measure F 50 TPR/FPR    0.94 measure F 9.9 TPR/FPR    0.99 measure F 99 TPR/FPR   
  • 26. 2/15/2021 Introduction to Data Mining, 2nd Edition 26 Which Classifer is better? Medium Skew case 01 . 0 FPR 5 . 0 (r) Recall TPR 83 . 0 (p) Precision     T1 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 50 50 Class=No 10 990 T2 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 99 1 Class=No 100 900 T3 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 99 1 Class=No 10 990 1 . 0 FPR 99 . 0 (r) Recall TPR 5 . 0 (p) Precision     01 . 0 FPR 99 . 0 (r) Recall TPR 9 . 0 (p) Precision     0.62 measure F 50 TPR/FPR    0.66 measure F 9.9 TPR/FPR    0.94 measure F 99 TPR/FPR   
  • 27. 2/15/2021 Introduction to Data Mining, 2nd Edition 27 Which Classifer is better? High Skew case 01 . 0 FPR 5 . 0 (r) Recall TPR 3 . 0 (p) Precision     T1 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 50 50 Class=No 100 9900 T2 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 99 1 Class=No 1000 9000 T3 PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes 99 1 Class=No 100 9900 1 . 0 FPR 99 . 0 (r) Recall TPR 09 . 0 (p) Precision     01 . 0 FPR 99 . 0 (r) Recall TPR 5 . 0 (p) Precision     0.375 measure F 50 TPR/FPR    0.165 measure F 9.9 TPR/FPR    0.66 measure F 99 TPR/FPR   
  • 28. 2/15/2021 Introduction to Data Mining, 2nd Edition 28 Building Classifiers with Imbalanced Training Set  Modify the distribution of training data so that rare class is well-represented in training set – Undersample the majority class – Oversample the rare class