The Class Imbalance Problem: AdaBoost to the Rescue?

THE CLASS IMBALANCE PROBLEM:
ADABOOST TO THE RESCUE?

BACKGROUND AND RESOURCES
Exploring AdaBoost and Random Forests machine
learning approaches for infrared pathology on
unbalanced data sets
Analyst, May 2021
Open access: https://doi.org/10.1039/D0AN02155E
Data and source code
Raw: https://doi.org/10.5281/zenodo.4986399
Processed: https://doi.org/10.5281/zenodo.4730312
Media
Video and slide deck: https://alexhenderson.info
Jiayi (Jennie) Tang Alex Henderson Peter Gardner
https://gardner-lab.com
https://alexhenderson.info
https://twitter.com/PeterGardnerUoM
https://twitter.com/AlexHenderson00

THE CLASS IMBALANCE PROBLEM
ALL THINGS BEING (UN)EQUAL

TISSUE PATHOLOGY
Epithelium 24.3%
Smooth Muscle 50.7%
Lymphocytes 2.5%
Blood 0.2%
Concretion 0.0%
Fibrous Stroma 12.3%
ECM 10.0%
Proc. SPIE 9041, Medical Imaging 2014:
Digital Pathology, 90410D; https://doi.org/10.1117/12.2043290
H&E stained
prostate
tissue
False colour
histopatholog
y classification

MACHINE LEARNING: ENSEMBLE METHODS
BOOSTING AND BAGGING; TREES AND STUMPS

ENSEMBLE METHODS IN MACHINE LEARNING
Machine learning: Collection (committee) of weak
learners

LEARNERS: THE WEAK VERSUS THE STRONG
One strong learner
 Difficult to build
 Need lots of information
 Specialised to problem
 Can overfit
Many weak learners
 Easy to build
 Each learner is barely better than guessing
 Generality

LEARNERS: THE WEAK VERSUS THE STRONG
One strong learner
 Difficult to build
 Need lots of information
 Specialised to problem
 Can overfit
Many weak learners
 Easy to build
 Each learner is barely better than guessing
 Generality
The Incredible Hulk. Avengers: Endgame V For Vendetta

DECISION TREE
 Most common weak learner
 Each node defines a question
 Variables can be Boolean,
categories, or numeric ranges
 Most critical question first, less
important questions follow
https://medium.datadriveninvestor.com/decision-trees-lesson-101-f00dad6cba21

RANDOM
FORESTS™
 Ensemble (collection) of
decision trees
 Each tree gets different
variables
 Many branches
 Many leaves
 Trees built in parallel
 Example of ‘bagging’
(bootstrap aggregation)
Trademark Leo Breiman & Adele
Cutler
https://www.flickr.com/photos/125012285@N07/14478851169/in/photostream/

DECISION STUMP
 Very weak learner (~51%)
 Only most critical question
considered
https://medium.datadriveninvestor.com/decision-trees-lesson-101-f00dad6cba21

ADABOOST
 Ensemble of decision tree
stumps
 Each tree gets different
variables
 One decision
 Two leaves
 Iterative
 Example of ‘boosting’
Effectively
a forest of stumps
https://www.conserve-energy-future.com/causes-effects-solutions-of-deforestation.php

Forrest Gump
…a Forrest
of Gumps!

ADAPTIVE BOOSTING: ADABOOST
MOST COMMON BOOSTING APPROACH

ADABOOST ITERATIONS
Iteration 1
weak classifier Iteration 2
weak classifier
Iteration 3
weak classifier
Model
stronger classifier
Misclassified samples
upweighted
Correctly classified,
downweighted
Combine
iterations,
with weightings
Misclassified samples upweighted
Correctly classified, downweighted

METHODS OF MANAGING CLASS IMBALANCE
UNDER-SAMPLING, OVER-SAMPLING

TISSUE DATA
Breast cancer TMA
Biomax BR20832
40 cores stage II breast cancer
10 cores normal-associated tissue
Top: H&E images
A = cancer
B = normal associated tissue
Bottom: FT-IR images
Red = cancerous epithelium
Purple = cancerous stroma
Green = NAT epithelium
Orange = NAT stroma
https://www.biomax.us/tissue-arrays/Breast/BR20832

UNDER-SAMPLING
 Easiest method to understand
 Determine class with the fewest members
 Randomly delete members of other classes until
all have the same number
 Discards much of the data, training set reduced
 Resulting model is weaker
 Remains unbiased, but with higher variance
0
200
400
600
800
1000
Class 1 Class 2 Class 3 Class 4
Under-sampling
Data retained Data discarded

OVER-SAMPLING
 Determine class with the most members
 Duplicate members of other classes to reach this
number
 Increases training data size
 Many approaches 0
200
400
600
800
1000
Class 1 Class 2 Class 3 Class 4
Over-sampling
Original data Duplicates

OVER-SAMPLING APPROACHES
Class 1 – majority – N samples
Class 2 – minority – P samples
N >> P
• Duplicate all samples in class 2, N-P times
• Randomly select N samples from class 2
(sampling with replacement)
• Randomly select N-P samples from
class 2 and append to original class 2
• Interpolate some class 2 members and append
(example is SMOTE†
)
†BMC Bioinformatics, 2013, 14, 106. https://doi.org/10.1186/1471-2105-14-106
Other approaches are available

OVER-SAMPLING APPROACHES
Assume class 1 is majority
with N samples
Class 2 is minority
with P samples
N >> P
• Duplicate all samples in class 2, N-P times
• Randomly select N samples from class 2
(sampling with replacement)
• Randomly select N-P samples from
class 2 and append to original class 2
• Interpolate some class 2 members and append
(example is SMOTE)
https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
All data in minority class is represented. Duplicates are ‘random sampling with replacement’ (Bootstrap)

RESULTS
SAME INDEPENDENT TEST SET USED THROUGHOUT

DIFFERENT CLASS SIZES: BALANCED DATA
AdaBoost Random Forests

UNDER-SAMPLING TRAINING SETS
Ratio
Num
cancer
Under-sampled
Num
NAT
Total
50:50 2500 2500 5000
50:50 2000 2000 4000
50:50 1500 1500 3000
50:50 1000 1000 2000
50:50 500 500 1000
Data sets are balanced, but can become small
All spectra are unique

UNDER-SAMPLE

OVER-SAMPLING TRAINING SETS
Data sets are balanced, but can become large
All cancer spectra are unique, but many NAT spectra are duplicates
Initia
l
ratio
Num
can
Over-sampled Nu
m
NAT
Tota
l
50:50 2500 U U U U U 2500 5000
60:40 3000 U U U U D D 3000 6000
70:30 3500 U U U D D D D 3500 7000
80:20 4000 U U D D D D D D 4000 8000
90:10 4500 U D D D D D D D D 4500 9000

OVER-SAMPLE

ADABOOST: UNDER AND OVER
Under-sample Over-sample

RANDOM FORESTS
Under-sample Over-sample

CONCLUSION
 Both models correctly classify > 90% of samples
 Models built with unbalanced classes can be misleading
 AdaBoost slightly better at classification
 Random Forests remains relatively stable until very small class sizes
 AdaBoost with over-sampling could be a good combination, particularly when our class imbalance is high

You don't understand! I could’ve been a contender. I could've had class… Real
class. On the Waterfront

The Class Imbalance Problem: AdaBoost to the Rescue?

Recommended

Recommended

More Related Content

Similar to The Class Imbalance Problem: AdaBoost to the Rescue?

Similar to The Class Imbalance Problem: AdaBoost to the Rescue? (20)

More from Alex Henderson

More from Alex Henderson (14)

Recently uploaded

Recently uploaded (20)

The Class Imbalance Problem: AdaBoost to the Rescue?

Editor's Notes