SlideShare a Scribd company logo
1 of 32
THE CLASS IMBALANCE PROBLEM:
ADABOOST TO THE RESCUE?
BACKGROUND AND RESOURCES
Exploring AdaBoost and Random Forests machine
learning approaches for infrared pathology on
unbalanced data sets
Analyst, May 2021
Open access: https://doi.org/10.1039/D0AN02155E
Data and source code
Raw: https://doi.org/10.5281/zenodo.4986399
Processed: https://doi.org/10.5281/zenodo.4730312
Media
Video and slide deck: https://alexhenderson.info
Jiayi (Jennie) Tang Alex Henderson Peter Gardner
https://gardner-lab.com
https://alexhenderson.info
https://twitter.com/PeterGardnerUoM
https://twitter.com/AlexHenderson00
THE CLASS IMBALANCE PROBLEM
ALL THINGS BEING (UN)EQUAL
TISSUE PATHOLOGY
Epithelium 24.3%
Smooth Muscle 50.7%
Lymphocytes 2.5%
Blood 0.2%
Concretion 0.0%
Fibrous Stroma 12.3%
ECM 10.0%
Proc. SPIE 9041, Medical Imaging 2014:
Digital Pathology, 90410D; https://doi.org/10.1117/12.2043290
H&E stained
prostate
tissue
False colour
histopatholog
y classification
MACHINE LEARNING: ENSEMBLE METHODS
BOOSTING AND BAGGING; TREES AND STUMPS
ENSEMBLE METHODS IN MACHINE LEARNING
Machine learning: Collection (committee) of weak
learners
LEARNERS: THE WEAK VERSUS THE STRONG
One strong learner
 Difficult to build
 Need lots of information
 Specialised to problem
 Can overfit
Many weak learners
 Easy to build
 Each learner is barely better than guessing
 Generality
LEARNERS: THE WEAK VERSUS THE STRONG
One strong learner
 Difficult to build
 Need lots of information
 Specialised to problem
 Can overfit
Many weak learners
 Easy to build
 Each learner is barely better than guessing
 Generality
The Incredible Hulk. Avengers: Endgame V For Vendetta
DECISION TREE
 Most common weak learner
 Each node defines a question
 Variables can be Boolean,
categories, or numeric ranges
 Most critical question first, less
important questions follow
https://medium.datadriveninvestor.com/decision-trees-lesson-101-f00dad6cba21
RANDOM
FORESTS™
 Ensemble (collection) of
decision trees
 Each tree gets different
variables
 Many branches
 Many leaves
 Trees built in parallel
 Example of ‘bagging’
(bootstrap aggregation)
Trademark Leo Breiman & Adele
Cutler
https://www.flickr.com/photos/125012285@N07/14478851169/in/photostream/
DECISION STUMP
 Very weak learner (~51%)
 Only most critical question
considered
https://medium.datadriveninvestor.com/decision-trees-lesson-101-f00dad6cba21
ADABOOST
 Ensemble of decision tree
stumps
 Each tree gets different
variables
 One decision
 Two leaves
 Iterative
 Example of ‘boosting’
Effectively
a forest of stumps
https://www.conserve-energy-future.com/causes-effects-solutions-of-deforestation.php
Forrest Gump
…a Forrest
of Gumps!
ADAPTIVE BOOSTING: ADABOOST
MOST COMMON BOOSTING APPROACH
ADABOOST ITERATIONS
Iteration 1
weak classifier Iteration 2
weak classifier
Iteration 3
weak classifier
Model
stronger classifier
Misclassified samples
upweighted
Correctly classified,
downweighted
Combine
iterations,
with weightings
Misclassified samples upweighted
Correctly classified, downweighted
METHODS OF MANAGING CLASS IMBALANCE
UNDER-SAMPLING, OVER-SAMPLING
TISSUE DATA
Breast cancer TMA
Biomax BR20832
40 cores stage II breast cancer
10 cores normal-associated tissue
Top: H&E images
A = cancer
B = normal associated tissue
Bottom: FT-IR images
Red = cancerous epithelium
Purple = cancerous stroma
Green = NAT epithelium
Orange = NAT stroma
https://www.biomax.us/tissue-arrays/Breast/BR20832
UNDER-SAMPLING
 Easiest method to understand
 Determine class with the fewest members
 Randomly delete members of other classes until
all have the same number
 Discards much of the data, training set reduced
 Resulting model is weaker
 Remains unbiased, but with higher variance
0
200
400
600
800
1000
Class 1 Class 2 Class 3 Class 4
Under-sampling
Data retained Data discarded
OVER-SAMPLING
 Determine class with the most members
 Duplicate members of other classes to reach this
number
 Increases training data size
 Many approaches 0
200
400
600
800
1000
Class 1 Class 2 Class 3 Class 4
Over-sampling
Original data Duplicates
OVER-SAMPLING APPROACHES
Class 1 – majority – N samples
Class 2 – minority – P samples
N >> P
• Duplicate all samples in class 2, N-P times
• Randomly select N samples from class 2
(sampling with replacement)
• Randomly select N-P samples from
class 2 and append to original class 2
• Interpolate some class 2 members and append
(example is SMOTE†
)
†BMC Bioinformatics, 2013, 14, 106. https://doi.org/10.1186/1471-2105-14-106
Other approaches are available
OVER-SAMPLING APPROACHES
Assume class 1 is majority
with N samples
Class 2 is minority
with P samples
N >> P
• Duplicate all samples in class 2, N-P times
• Randomly select N samples from class 2
(sampling with replacement)
• Randomly select N-P samples from
class 2 and append to original class 2
• Interpolate some class 2 members and append
(example is SMOTE)
https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
All data in minority class is represented. Duplicates are ‘random sampling with replacement’ (Bootstrap)
RESULTS
SAME INDEPENDENT TEST SET USED THROUGHOUT
DIFFERENT CLASS SIZES: BALANCED DATA
AdaBoost Random Forests
UNDER-SAMPLING TRAINING SETS
Ratio
Num
cancer
Under-sampled
Num
NAT
Total
50:50 2500 2500 5000
50:50 2000 2000 4000
50:50 1500 1500 3000
50:50 1000 1000 2000
50:50 500 500 1000
Data sets are balanced, but can become small
All spectra are unique
UNDER-SAMPLE
AdaBoost Random Forests
OVER-SAMPLING TRAINING SETS
Data sets are balanced, but can become large
All cancer spectra are unique, but many NAT spectra are duplicates
Initia
l
ratio
Num
can
Over-sampled Nu
m
NAT
Tota
l
50:50 2500 U U U U U 2500 5000
60:40 3000 U U U U D D 3000 6000
70:30 3500 U U U D D D D 3500 7000
80:20 4000 U U D D D D D D 4000 8000
90:10 4500 U D D D D D D D D 4500 9000
OVER-SAMPLE
AdaBoost Random Forests
ADABOOST: UNDER AND OVER
Under-sample Over-sample
RANDOM FORESTS
Under-sample Over-sample
CONCLUSION
 Both models correctly classify > 90% of samples
 Models built with unbalanced classes can be misleading
 AdaBoost slightly better at classification
 Random Forests remains relatively stable until very small class sizes
 AdaBoost with over-sampling could be a good combination, particularly when our class imbalance is high
You don't understand! I could’ve been a contender. I could've had class… Real
class. On the Waterfront
CONCLUSION
 Both models correctly classify > 90% of samples
 Models built with unbalanced classes can be misleading
 AdaBoost slightly better at classification
 Random Forests remains relatively stable until very small class sizes
 AdaBoost with over-sampling could be a good combination, particularly when our class imbalance is high

More Related Content

Similar to The Class Imbalance Problem: AdaBoost to the Rescue?

Genomica - Microarreglos de DNA
Genomica - Microarreglos de DNAGenomica - Microarreglos de DNA
Genomica - Microarreglos de DNAUlises Urzua
 
2016 Presentation at the University of Hawaii Cancer Center
2016 Presentation at the University of Hawaii Cancer Center2016 Presentation at the University of Hawaii Cancer Center
2016 Presentation at the University of Hawaii Cancer CenterCasey Greene
 
Essential Biology 04.4 Genetic Engineering & Biotechnology
Essential Biology 04.4 Genetic Engineering & BiotechnologyEssential Biology 04.4 Genetic Engineering & Biotechnology
Essential Biology 04.4 Genetic Engineering & BiotechnologyStephen Taylor
 
Essential Biology 04.1 Chromosomes, Genes, Alleles, Mutations
Essential Biology 04.1   Chromosomes, Genes, Alleles, MutationsEssential Biology 04.1   Chromosomes, Genes, Alleles, Mutations
Essential Biology 04.1 Chromosomes, Genes, Alleles, MutationsStephen Taylor
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
 
BES/SfE talk 2014
BES/SfE talk 2014BES/SfE talk 2014
BES/SfE talk 2014Bob O'Hara
 
A class skew-insensitive ACO-based decision tree algorithm for imbalanced dat...
A class skew-insensitive ACO-based decision tree algorithm for imbalanced dat...A class skew-insensitive ACO-based decision tree algorithm for imbalanced dat...
A class skew-insensitive ACO-based decision tree algorithm for imbalanced dat...nooriasukmaningtyas
 
Breast cancer detection using Artificial Neural Network
Breast cancer detection using Artificial Neural NetworkBreast cancer detection using Artificial Neural Network
Breast cancer detection using Artificial Neural NetworkSubroto Biswas
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsElena Sügis
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1Double Check ĆŐNSULTING
 
Introduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisIntroduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
 
Identification of toxicants and metabolites
Identification of toxicants and metabolitesIdentification of toxicants and metabolites
Identification of toxicants and metabolitesSteffen Neumann
 
Essential Biology 4.3 Theoretical Genetics
Essential Biology 4.3 Theoretical GeneticsEssential Biology 4.3 Theoretical Genetics
Essential Biology 4.3 Theoretical GeneticsStephen Taylor
 
2013 machine learning_choih
2013 machine learning_choih2013 machine learning_choih
2013 machine learning_choihHongyoon Choi
 
Slides for st judes
Slides for st judesSlides for st judes
Slides for st judesSean Ekins
 
Research Methodology 1
 Research Methodology 1 Research Methodology 1
Research Methodology 1Tamer Hifnawy
 
Application of Fluorescence Activated-Cell Sorting (FACS) in separation of di...
Application of Fluorescence Activated-Cell Sorting (FACS) in separation of di...Application of Fluorescence Activated-Cell Sorting (FACS) in separation of di...
Application of Fluorescence Activated-Cell Sorting (FACS) in separation of di...Goh Mei Ying
 
acs talk open source drug discovery
acs talk open source drug discoveryacs talk open source drug discovery
acs talk open source drug discoverySean Ekins
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 

Similar to The Class Imbalance Problem: AdaBoost to the Rescue? (20)

Genomica - Microarreglos de DNA
Genomica - Microarreglos de DNAGenomica - Microarreglos de DNA
Genomica - Microarreglos de DNA
 
2016 Presentation at the University of Hawaii Cancer Center
2016 Presentation at the University of Hawaii Cancer Center2016 Presentation at the University of Hawaii Cancer Center
2016 Presentation at the University of Hawaii Cancer Center
 
PhD midterm report
PhD midterm reportPhD midterm report
PhD midterm report
 
Essential Biology 04.4 Genetic Engineering & Biotechnology
Essential Biology 04.4 Genetic Engineering & BiotechnologyEssential Biology 04.4 Genetic Engineering & Biotechnology
Essential Biology 04.4 Genetic Engineering & Biotechnology
 
Essential Biology 04.1 Chromosomes, Genes, Alleles, Mutations
Essential Biology 04.1   Chromosomes, Genes, Alleles, MutationsEssential Biology 04.1   Chromosomes, Genes, Alleles, Mutations
Essential Biology 04.1 Chromosomes, Genes, Alleles, Mutations
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
 
BES/SfE talk 2014
BES/SfE talk 2014BES/SfE talk 2014
BES/SfE talk 2014
 
A class skew-insensitive ACO-based decision tree algorithm for imbalanced dat...
A class skew-insensitive ACO-based decision tree algorithm for imbalanced dat...A class skew-insensitive ACO-based decision tree algorithm for imbalanced dat...
A class skew-insensitive ACO-based decision tree algorithm for imbalanced dat...
 
Breast cancer detection using Artificial Neural Network
Breast cancer detection using Artificial Neural NetworkBreast cancer detection using Artificial Neural Network
Breast cancer detection using Artificial Neural Network
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
 
Introduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisIntroduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysis
 
Identification of toxicants and metabolites
Identification of toxicants and metabolitesIdentification of toxicants and metabolites
Identification of toxicants and metabolites
 
Essential Biology 4.3 Theoretical Genetics
Essential Biology 4.3 Theoretical GeneticsEssential Biology 4.3 Theoretical Genetics
Essential Biology 4.3 Theoretical Genetics
 
2013 machine learning_choih
2013 machine learning_choih2013 machine learning_choih
2013 machine learning_choih
 
Slides for st judes
Slides for st judesSlides for st judes
Slides for st judes
 
Research Methodology 1
 Research Methodology 1 Research Methodology 1
Research Methodology 1
 
Application of Fluorescence Activated-Cell Sorting (FACS) in separation of di...
Application of Fluorescence Activated-Cell Sorting (FACS) in separation of di...Application of Fluorescence Activated-Cell Sorting (FACS) in separation of di...
Application of Fluorescence Activated-Cell Sorting (FACS) in separation of di...
 
acs talk open source drug discovery
acs talk open source drug discoveryacs talk open source drug discovery
acs talk open source drug discovery
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 

More from Alex Henderson

FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Hyperspectral Data Issues
Hyperspectral Data IssuesHyperspectral Data Issues
Hyperspectral Data IssuesAlex Henderson
 
Getting started with chemometric classification
Getting started with chemometric classificationGetting started with chemometric classification
Getting started with chemometric classificationAlex Henderson
 
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your dataAlex Henderson
 
2020 Vision (Dubious Design Decisions)
2020 Vision (Dubious Design Decisions)2020 Vision (Dubious Design Decisions)
2020 Vision (Dubious Design Decisions)Alex Henderson
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balanceAlex Henderson
 
Digging into Data: Analysis and Visualisation in 3D
Digging into Data: Analysis and Visualisation in 3DDigging into Data: Analysis and Visualisation in 3D
Digging into Data: Analysis and Visualisation in 3DAlex Henderson
 
Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis
Rise of the Machines: The Use of Machine Learning in SIMS Data AnalysisRise of the Machines: The Use of Machine Learning in SIMS Data Analysis
Rise of the Machines: The Use of Machine Learning in SIMS Data AnalysisAlex Henderson
 
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopyWhat's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopyAlex Henderson
 
How to validate your model
How to validate your modelHow to validate your model
How to validate your modelAlex Henderson
 
Interpretation of Static SIMS Spectra
Interpretation of Static SIMS SpectraInterpretation of Static SIMS Spectra
Interpretation of Static SIMS SpectraAlex Henderson
 
Secondary Ion Mass Spectrometry
Secondary Ion Mass SpectrometrySecondary Ion Mass Spectrometry
Secondary Ion Mass SpectrometryAlex Henderson
 

More from Alex Henderson (13)

FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Hyperspectral Data Issues
Hyperspectral Data IssuesHyperspectral Data Issues
Hyperspectral Data Issues
 
Getting started with chemometric classification
Getting started with chemometric classificationGetting started with chemometric classification
Getting started with chemometric classification
 
Too good to be true? How validate your data
Too good to be true? How validate your dataToo good to be true? How validate your data
Too good to be true? How validate your data
 
2020 Vision (Dubious Design Decisions)
2020 Vision (Dubious Design Decisions)2020 Vision (Dubious Design Decisions)
2020 Vision (Dubious Design Decisions)
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
 
Digging into Data: Analysis and Visualisation in 3D
Digging into Data: Analysis and Visualisation in 3DDigging into Data: Analysis and Visualisation in 3D
Digging into Data: Analysis and Visualisation in 3D
 
Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis
Rise of the Machines: The Use of Machine Learning in SIMS Data AnalysisRise of the Machines: The Use of Machine Learning in SIMS Data Analysis
Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis
 
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopyWhat's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
 
How to validate your model
How to validate your modelHow to validate your model
How to validate your model
 
Interpretation of Static SIMS Spectra
Interpretation of Static SIMS SpectraInterpretation of Static SIMS Spectra
Interpretation of Static SIMS Spectra
 
Secondary Ion Mass Spectrometry
Secondary Ion Mass SpectrometrySecondary Ion Mass Spectrometry
Secondary Ion Mass Spectrometry
 

Recently uploaded

Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxCherry
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptxCherry
 
GBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolationGBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolationAreesha Ahmad
 
Method of Quantifying interactions and its types
Method of Quantifying interactions and its typesMethod of Quantifying interactions and its types
Method of Quantifying interactions and its typesNISHIKANTKRISHAN
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Understanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution MethodsUnderstanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution Methodsimroshankoirala
 
Lipids: types, structure and important functions.
Lipids: types, structure and important functions.Lipids: types, structure and important functions.
Lipids: types, structure and important functions.Cherry
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptxMuhammadRazzaq31
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneyX-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneySérgio Sacani
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Cherry
 
ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteRaunakRastogi4
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...Nistarini College, Purulia (W.B) India
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Daily Lesson Log in Science 9 Fourth Quarter Physics
Daily Lesson Log in Science 9 Fourth Quarter PhysicsDaily Lesson Log in Science 9 Fourth Quarter Physics
Daily Lesson Log in Science 9 Fourth Quarter PhysicsWILSONROMA4
 

Recently uploaded (20)

Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 
GBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolationGBSN - Microbiology (Unit 5) Concept of isolation
GBSN - Microbiology (Unit 5) Concept of isolation
 
Method of Quantifying interactions and its types
Method of Quantifying interactions and its typesMethod of Quantifying interactions and its types
Method of Quantifying interactions and its types
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Understanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution MethodsUnderstanding Partial Differential Equations: Types and Solution Methods
Understanding Partial Differential Equations: Types and Solution Methods
 
Lipids: types, structure and important functions.
Lipids: types, structure and important functions.Lipids: types, structure and important functions.
Lipids: types, structure and important functions.
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneyX-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.
 
ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for vote
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...Energy is the beat of life irrespective of the domains. ATP- the energy curre...
Energy is the beat of life irrespective of the domains. ATP- the energy curre...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Daily Lesson Log in Science 9 Fourth Quarter Physics
Daily Lesson Log in Science 9 Fourth Quarter PhysicsDaily Lesson Log in Science 9 Fourth Quarter Physics
Daily Lesson Log in Science 9 Fourth Quarter Physics
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 

The Class Imbalance Problem: AdaBoost to the Rescue?

  • 1. THE CLASS IMBALANCE PROBLEM: ADABOOST TO THE RESCUE?
  • 2. BACKGROUND AND RESOURCES Exploring AdaBoost and Random Forests machine learning approaches for infrared pathology on unbalanced data sets Analyst, May 2021 Open access: https://doi.org/10.1039/D0AN02155E Data and source code Raw: https://doi.org/10.5281/zenodo.4986399 Processed: https://doi.org/10.5281/zenodo.4730312 Media Video and slide deck: https://alexhenderson.info Jiayi (Jennie) Tang Alex Henderson Peter Gardner https://gardner-lab.com https://alexhenderson.info https://twitter.com/PeterGardnerUoM https://twitter.com/AlexHenderson00
  • 3. THE CLASS IMBALANCE PROBLEM ALL THINGS BEING (UN)EQUAL
  • 4. TISSUE PATHOLOGY Epithelium 24.3% Smooth Muscle 50.7% Lymphocytes 2.5% Blood 0.2% Concretion 0.0% Fibrous Stroma 12.3% ECM 10.0% Proc. SPIE 9041, Medical Imaging 2014: Digital Pathology, 90410D; https://doi.org/10.1117/12.2043290 H&E stained prostate tissue False colour histopatholog y classification
  • 5. MACHINE LEARNING: ENSEMBLE METHODS BOOSTING AND BAGGING; TREES AND STUMPS
  • 6. ENSEMBLE METHODS IN MACHINE LEARNING Machine learning: Collection (committee) of weak learners
  • 7. LEARNERS: THE WEAK VERSUS THE STRONG One strong learner  Difficult to build  Need lots of information  Specialised to problem  Can overfit Many weak learners  Easy to build  Each learner is barely better than guessing  Generality
  • 8. LEARNERS: THE WEAK VERSUS THE STRONG One strong learner  Difficult to build  Need lots of information  Specialised to problem  Can overfit Many weak learners  Easy to build  Each learner is barely better than guessing  Generality The Incredible Hulk. Avengers: Endgame V For Vendetta
  • 9. DECISION TREE  Most common weak learner  Each node defines a question  Variables can be Boolean, categories, or numeric ranges  Most critical question first, less important questions follow https://medium.datadriveninvestor.com/decision-trees-lesson-101-f00dad6cba21
  • 10. RANDOM FORESTS™  Ensemble (collection) of decision trees  Each tree gets different variables  Many branches  Many leaves  Trees built in parallel  Example of ‘bagging’ (bootstrap aggregation) Trademark Leo Breiman & Adele Cutler https://www.flickr.com/photos/125012285@N07/14478851169/in/photostream/
  • 11. DECISION STUMP  Very weak learner (~51%)  Only most critical question considered https://medium.datadriveninvestor.com/decision-trees-lesson-101-f00dad6cba21
  • 12. ADABOOST  Ensemble of decision tree stumps  Each tree gets different variables  One decision  Two leaves  Iterative  Example of ‘boosting’ Effectively a forest of stumps https://www.conserve-energy-future.com/causes-effects-solutions-of-deforestation.php
  • 14. ADAPTIVE BOOSTING: ADABOOST MOST COMMON BOOSTING APPROACH
  • 15. ADABOOST ITERATIONS Iteration 1 weak classifier Iteration 2 weak classifier Iteration 3 weak classifier Model stronger classifier Misclassified samples upweighted Correctly classified, downweighted Combine iterations, with weightings Misclassified samples upweighted Correctly classified, downweighted
  • 16. METHODS OF MANAGING CLASS IMBALANCE UNDER-SAMPLING, OVER-SAMPLING
  • 17. TISSUE DATA Breast cancer TMA Biomax BR20832 40 cores stage II breast cancer 10 cores normal-associated tissue Top: H&E images A = cancer B = normal associated tissue Bottom: FT-IR images Red = cancerous epithelium Purple = cancerous stroma Green = NAT epithelium Orange = NAT stroma https://www.biomax.us/tissue-arrays/Breast/BR20832
  • 18. UNDER-SAMPLING  Easiest method to understand  Determine class with the fewest members  Randomly delete members of other classes until all have the same number  Discards much of the data, training set reduced  Resulting model is weaker  Remains unbiased, but with higher variance 0 200 400 600 800 1000 Class 1 Class 2 Class 3 Class 4 Under-sampling Data retained Data discarded
  • 19. OVER-SAMPLING  Determine class with the most members  Duplicate members of other classes to reach this number  Increases training data size  Many approaches 0 200 400 600 800 1000 Class 1 Class 2 Class 3 Class 4 Over-sampling Original data Duplicates
  • 20. OVER-SAMPLING APPROACHES Class 1 – majority – N samples Class 2 – minority – P samples N >> P • Duplicate all samples in class 2, N-P times • Randomly select N samples from class 2 (sampling with replacement) • Randomly select N-P samples from class 2 and append to original class 2 • Interpolate some class 2 members and append (example is SMOTE† ) †BMC Bioinformatics, 2013, 14, 106. https://doi.org/10.1186/1471-2105-14-106 Other approaches are available
  • 21. OVER-SAMPLING APPROACHES Assume class 1 is majority with N samples Class 2 is minority with P samples N >> P • Duplicate all samples in class 2, N-P times • Randomly select N samples from class 2 (sampling with replacement) • Randomly select N-P samples from class 2 and append to original class 2 • Interpolate some class 2 members and append (example is SMOTE) https://en.wikipedia.org/wiki/Bootstrapping_(statistics) All data in minority class is represented. Duplicates are ‘random sampling with replacement’ (Bootstrap)
  • 22. RESULTS SAME INDEPENDENT TEST SET USED THROUGHOUT
  • 23. DIFFERENT CLASS SIZES: BALANCED DATA AdaBoost Random Forests
  • 24. UNDER-SAMPLING TRAINING SETS Ratio Num cancer Under-sampled Num NAT Total 50:50 2500 2500 5000 50:50 2000 2000 4000 50:50 1500 1500 3000 50:50 1000 1000 2000 50:50 500 500 1000 Data sets are balanced, but can become small All spectra are unique
  • 26. OVER-SAMPLING TRAINING SETS Data sets are balanced, but can become large All cancer spectra are unique, but many NAT spectra are duplicates Initia l ratio Num can Over-sampled Nu m NAT Tota l 50:50 2500 U U U U U 2500 5000 60:40 3000 U U U U D D 3000 6000 70:30 3500 U U U D D D D 3500 7000 80:20 4000 U U D D D D D D 4000 8000 90:10 4500 U D D D D D D D D 4500 9000
  • 28. ADABOOST: UNDER AND OVER Under-sample Over-sample
  • 30. CONCLUSION  Both models correctly classify > 90% of samples  Models built with unbalanced classes can be misleading  AdaBoost slightly better at classification  Random Forests remains relatively stable until very small class sizes  AdaBoost with over-sampling could be a good combination, particularly when our class imbalance is high
  • 31. You don't understand! I could’ve been a contender. I could've had class… Real class. On the Waterfront
  • 32. CONCLUSION  Both models correctly classify > 90% of samples  Models built with unbalanced classes can be misleading  AdaBoost slightly better at classification  Random Forests remains relatively stable until very small class sizes  AdaBoost with over-sampling could be a good combination, particularly when our class imbalance is high

Editor's Notes

  1. Hello. I’d like to thank the organizers for giving me this opportunity to tell you about some work we’ve been doing in Manchester, using machine learning to look at unbalanced classes.
  2. My name is Alex Henderson, and this presentation outlines work recently published in the Analyst, which is available Open Access. Both the raw, and processed, data are available on Zenodo, and this video and slide deck will be made available from my and the group’s website, following the conference. I think it’s only fair to point out that Jennie did all the work, and I only hope I can do a good job of representing her today!
  3. So, what is the class imbalance problem?
  4. Consider a piece of tissue, stained with H&E to highlight the cell morphology. We can analyse this using infrared, [CLICK] and build a model to identify various cell types. Note, however, that there is a wide range in the composition of the tissue. Some cell types only appear in very low abundance. And it’s this difference in the number of spectra in each class, that can present a problem when we come to build our chemometric models.
  5. In this study we have explored adaptive boosting - or AdaBoost - and compared its performance against the Random Forests algorithm, now used by a number of groups, including ourselves.
  6. Both AdaBoost and Random Forests fall into the category of Ensemble Methods. An ‘ensemble’ is just another way of saying ‘a collection’, where the members of that collection are of the same type, but possibly different state. Ensemble methods use collections of what are called - ‘weak learners’ - to attack the problem at hand.
  7. These methods use many weak learners, rather than a single strong learner. Strong learners can be difficult to build and may require a lot of data. They are tuned to the problem at hand, but can overfit if tuned too closely. Weak learners on the other hand are relatively easy to build. The term ‘weak learner’ comes from the idea that they are not really very good at learning! A single weak learner has a success rate of barely over 50%; only just better than guessing, or tossing a coin. However, when brought together en masse, they gel to form good models. Better than the sum of their parts, you could say!
  8. So, while a strong learner will be useful for specific challenges, weak learners benefit from: ‘the wisdom of the crowds’.
  9. The most common weak learner in ensemble learning is the decision tree, and these are used in both Random Forests and AdaBoost. Here, the variable that best separates the training set data, becomes the ‘root node’. The data is then split into different branches. Each branch is considered separately, and the best variable for that branch becomes the decision point for the next split. The same variables can appear in different branches, in different orders, since the source data is changing after each split. Eventually no further splits are required, and the outcome appears in leaf nodes. Remember that these trees are not meant to be very good at making decisions! That’s the whole point!
  10. A random forest is a collection of decision trees, with each tree being given a different set of variables. This prevents any single variable from dominating in the resulting model.
  11. For boosting approaches, AdaBoost being the first and most common, we make the decision trees even more ‘dumb’, by only allowing a single decision split. This produces, what’s called a ‘decision tree stump’. The root node is still defined around the variable that is most ‘important’ in separating the data in the training set, but other variables don’t get a look in. Because there is only one split, the tree can’t ‘refine’ its decision, so it just has to go, with what it’s got.
  12. So, AdaBoost uses a collection of decision tree stumps, rather than full trees. Each tree gets different variables in the same way as Random Forests, but the trees only get to make a single choice. The main difference between boosting techniques, such as AdaBoost, and a bagging approach like Random Forests, is that that boosting is ‘iterative’. So AdaBoost is effectively a forest of stumps…
  13. [CLICK] …not to be confused with… …a Forrest of Gumps! Sorry, couldn’t resist!
  14. The name AdaBoost is short for Adaptive Boosting. In this case the adaptive part is introduced by iteration and weighting.
  15. [CLICK] To start with all samples are weighted equally. The decision tree (stump) then identifies a parameter that can split the data into class A or class B; in this case triangles and squares. Any samples that were misclassified are then upweighted, with those correctly classified being downweighted. These modified data are then presented to a new decision tree. Since the weights on the previously misclassified samples are now higher, they are more likely to be correctly classified. Now, it is important to point out here that we’re not multiplying the spectral data points by this weighting; we’re changing their relative importance to the algorithm. Next the misclassified samples from this second iteration are upweighted, with the correctly classified samples being downweighted, and we go for a third iteration. After three iterations we stop, we combine the iterations and produce the ‘outcome’ of that tree ‘set’. So, by iterating, and biasing each iteration in favour of samples there were wrongly classified in previous steps, we produce a stronger classifier. This might not be a VERY strong classifier, but it will be used in combination with others in the overall algorithm. As with the Random Forests approach, when we introduce test data, each tree (or tree set) gets a vote for whichever class it thinks that test sample should fall into. There are various metrics that can be used here, but the majority vote is the easiest to think about and easiest to apply.
  16. So, now we have our problem, and two potential algorithms to apply, how well do they work when presented with unbalanced data?
  17. To assess this we used a tissue microarray containing breast cancer tissue from 208 patients. We selected 40 cores relating to cancer and 10 relating to normal associated tissue. Normal associated tissue is tissue from regions adjacent to a tumour from non-malignant cores. You don’t usually get access to healthy tissue. After all, most people don’t want to have a biopsy unless there is some VERY GOOD underlying medical reason! We manually annotated these tissues, according to W.H.O. guidelines, and identified regions corresponding to cancerous epithelium and normal associated epithelium. We also annotated normal and cancerous stroma, but those spectra were not included in this study.
  18. So, the first sampling method we will take a look at is under-sampling. In this method we identify the class with the fewest members and reduce all other classes to that number. This is simple to understand and to apply. The downside is that we tend to throw away lots of data. If the smallest class is much smaller than the others, we will end up discarding most of the data acquired. This has the knock-on effect of weakening the model because the data available for the training set will be a smaller sample of the acquired population. The good thing about under-sampling is that all the spectra remain unique, there are no duplicates. The model will be unbiased, but will have a higher variance.
  19. The opposite of under-sampling is, of course, over-sampling! In this scenario we increase the numbers in each of the minority classes to match the class with the most members. This will increase the size of the training set, which could be problematic for the target algorithm or computational resource available. The biggest problem, however, comes when we have to decide on where these increased numbers will come from.
  20. There are lots of methods we can choose to over-sample our data. Here I’ve listed four. The first simply takes a copy of the smaller class and appends it to itself. We can repeat this until we reach the size of the larger class. Of course we will never get an exact match, well pretty unlikely anyway, so we need a method of dealing with the over/under hang. We can simply ignore this and say our classes are now much more similar, or we can use some form of randomisation to get the exact number. This has the benefit of each spectrum in the minority class being equally represented in the newly generated group; well without taking into account the randomness if that’s the way we want to go. And, of course, there are other approaches we could take. The second approach uses something like a Bootstrap sampling approach, which is ‘sampling with replacement’, to randomly re-generate the minority class. Bootstrap has low bias and variance, but there could be samples, that never actually get selected. That means we are throwing away some original data. Method three is similar to method two, except we ensure all the minority class are included and only Bootstrap the required difference. Then there is the option of changing the data. The first three methods simply selected (or didn’t select) the spectra in the minority class. Another approach is to interpolate some of the spectra to generate data that was never actually acquired. One of these methods is called SMOTE and is discussed in a paper by Blagus and Lusa.
  21. However, in this work we decided to go with method 3. This has the advantage of ensuring all the data acquired, relating to the minority class, are actually included in the training set, and any duplication being handled by the well-respected Bootstrap method.
  22. So how did we get on? First I should mention that the same independent test set was used in all cases. In addition we tried as much as possible to create training sets that were built by either expanding or contracting existing training sets, rather than generating each one randomly. This has the advantage of showing the variation in having larger or smaller data sets, rather than new ones created randomly. If we were to create lots of random data sets, some trends might be hidden. In all cases the exercise of generating training sets and testing them was repeated 5 times. But with the same independent test set used in each case.
  23. So, it’s useful to get some ground truth, so we know whether any changes we see as a function of sampling, are actually due to the change in the size of the training sets themselves. We created balanced sets of different size from 2,500 per class, down to 10. As you can see both algorithms perform surprisingly well. It’s not until we get down to 100 samples per class that AdaBoost starts to fall over. At this point all samples are being classified as normal-associated. However, when we have large numbers per class it performs a little better than Random Forests. Although, you have to say that classification accuracies of 90% and over, is really rather good: it is worth pointing out here that all these data are generated from the same TMA, so accuracies of this level will probably not be maintained across different samples, instruments etc. However, using the same sample has the benefit of removing these additional sources of error, so we can concentrate on the performance of the algorithms themselves, and the sampling methods. On the right, we can see that the Random Forests method stays pretty strong beyond 100 samples, and can even generate a reasonable result with only 10 samples per class!
  24. So, taking a closer view of the left hand side of that plot, we generated some under-sampled training data. Each of these training sets has the same number of cancer and normal associated spectra, but as the size of the minority set gets smaller, you can see we end up throwing away lots of the majority class to match.
  25. AdaBoost appears to out-perform Random Forests with the normal-associated tissue being almost perfectly classified for all samples sizes. Although to be fair, they both do pretty well. The cancer samples do not perform quite as well, so more are being misclassified as the training set gets smaller. The variability in the Random Forests data is slightly larger too.
  26. Over-sampling is a bit more complicated. The red box in the table on the right indicates the spectra that are unique. That includes all the cancer spectra and normal-associated spectra originally in the samples. In order to over-sample we randomly duplicate more and more of the normal-associated, to keep up with the growing cancer data set. The dark blue squares labelled - D - represent duplicates, while the light blue squares labelled - U - represent the original spectra. As you can see, by the time we have a ratio of 9 to 1 we have 4,500 cancer spectra, each of which is unique, but only 500 unique normal-associated spectra. From these 500 we now need to randomly select another 4,000 spectra.
  27. So, how does this duplication affect the outcome? Well, the AdaBoost method still seems to perform strongly. Note that the two lines cross over when our ratio is very large. This is probably due to the duplication in the normal-associated data leading to overfitting and that being reflected in its inability to correctly classify the test data. The Random Forests method performs less well, and appears to be more influenced by the duplication than AdaBoost.
  28. It’s worth taking a moment to compare the two sampling methods, using the same algorithm. With AdaBoost it looks like over-sampling works best and the level of classification accuracy remains fairly constant as the sample sizes change.
  29. However, with Random Forests we get a different answer. Note how under-sampling improves the normal-associated accuracy, while the cancer samples become less well classified. However, with over-sampling we get the opposite effect. The cancer samples get better, but the normal-associated fall away. This is worrying because it means we could get a different answer depending on the choice of algorithm AND the choice of sampling method.
  30. So, what did we learn from doing this work? Firstly, on this, admittedly, limited, data set, we can see that infrared does a good job of classifying cancer from non-cancer data. We have been discussing values in the 80-95% accuracy range, and, even allowing, for the use of a single instrument and a single TMA, this is an indication that IR is useful here. -However, we need to be careful in our choice of algorithm and sampling method because our results could be misleading. -AdaBoost seems to be slightly better at classification, and both AdaBoost and Random Forests will give good accuracy down to about 100 spectra per class (under-sampled). And, Random Forests remains relatively stable until we reach very small class sizes; in the 10s -AdaBoost seems to be stable to over-sampling, while Random Forests is only stable for ranges that are relatively close; down to about 70:30. Coming back to our original question, for unbalanced classes, will AdaBoost come to the rescue?
  31. Well, I think the jury is still out. However, I think AdaBoost IS a contender, and we should do more work in this area to see how useful it can be
  32. Thanks for listening