MULTICLASS
CLASSIFICATION OF
IMBALANCED DATA
SAURABH WANI
27/04/2019
What are Imbalanced Datasets ?
Why do we need to work with
Imbalanced Datasets ?
Applications of Handling Imbalanced Data
Terrorist
Identification
Credit Card Fraud
Detection
Rare Disease
Identification
Anomaly Detection
The MYTH of Accuracy
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
Confusion Matrix
𝐴𝑐𝑐 =
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
𝑃𝑟ecision =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝐹1 = 2 ∗
𝑃 ∗ 𝑅
𝑃 + 𝑅
Overfitting and Underfitting
Handling Imbalanced Data
Resampling Techniques Algorithmic Ensemble Techniques
1. Undersampling
2. Oversampling
3. Clustering based
resampling
4. Sample Synthesis
1. Bootstrap Aggregating
(Bagging)
2. Boosting
3. Ada-Boost
Resampling Techniques
Undersampling and Oversampling
Clustering based Undersampling
Cluster the majority class, to
create a smaller set – replace
actual data from majority class.
•Can cause accuracy loss for
negative cases
•Centroids could be randomized
Sample Synthesis
SMOTE ADASYN
SMOTE
(Synthetic Minority Oversampling Technique)
Algorithmic Ensemble Techniques
Bagging
(Bootstrap Aggregating)
Bootstrapping
The method of randomly
assigning k samples out of n
samples to a subset with
replacement
• Can be performed on both
Regression and Classification
• Increase Accuracy and reduce
Variance
• Reduce the problem of
Overfitting
The weak learners (Models) are
chosen in such a way that each
of them specializes in
predictions based on one of the
feature space.
Boosting
The term ‘Boosting’ refers to a
family of algorithms which
converts weak learner to strong
learners. Boosting is an ensemble
method for improving the model
predictions of any given learning
algorithm. The idea of boosting is
to train weak learners
sequentially, each trying to
correct its predecessor.
Ada-Boost
(Adaptive Boosting)
It focuses on classification problems and
aims to convert a set of weak classifiers into
a strong one.
1. ‘Stumps’ are the weak learners
almost every time
2. Each sample has a weight
assigned to it (Equal in the
beginning)
3. Weights are reassigned based
on the accuracy of classification
of that specific sample
4. The higher the weight, the more
the focus will be on that sample
Gradient Boosting
1. Instead of adjusting weights
in each iteration Gradient
Boosting algorithms
minimizes the error term each
time
2. ‘Gradient Descent’ is used for
error correction
3. Highly efficient
4. Really slow
Thank You

Multiclass classification of imbalanced data