Here we discuss the issues when applying Random Forests and AdaBoost data analysis methods to infrared spectroscopy data sets, where the numbers in each class vary.
Invited presentation at the 11th International Conference on Advanced Vibrational Spectroscopy (ICAVS-11), 23-26 August 2021. This was a virtual conference.
This presentation relates to our paper in Analyst "Exploring AdaBoost and Random Forests machine learning approaches for infrared pathology on unbalanced data sets" by Jiayi Tang, Alex Henderson and Peter Gardner.
Paper: https://doi.org/10.1039/D0AN02155E (available open access, CC-BY).
Raw data: https://doi.org/10.5281/zenodo.4986399 (CC-BY)
Processed data, and MATLAB source code: https://doi.org/10.5281/zenodo.4730312 (CC-BY)
Abstract
The use of infrared spectroscopy to augment decision-making in histopathology is a promising direction for the diagnosis of many disease types. Hyperspectral images of healthy and diseased tissue, generated by infrared spectroscopy, are used to build chemometric models that can provide objective metrics of disease state. It is important to build robust and stable models to provide confidence to the end user. The data used to develop such models can have a variety of characteristics which can pose problems to many model-building approaches. Here we have compared the performance of two machine learning algorithms – AdaBoost and Random Forests – on a variety of non-uniform data sets. Using samples of breast cancer tissue, we devised a range of training data capable of describing the problem space. Models were constructed from these training sets and their characteristics compared. In terms of separating infrared spectra of cancerous epithelium tissue from normal-associated tissue on the tissue microarray, both AdaBoost and Random Forests algorithms were shown to give excellent classification performance (over 95% accuracy) in this study. AdaBoost models were more robust when datasets with large imbalance were provided. The outcomes of this work are a measure of classification accuracy as a function of training data available, and a clear recommendation for choice of machine learning approach.
2. BACKGROUND AND RESOURCES
Exploring AdaBoost and Random Forests machine
learning approaches for infrared pathology on
unbalanced data sets
Analyst, May 2021
Open access: https://doi.org/10.1039/D0AN02155E
Data and source code
Raw: https://doi.org/10.5281/zenodo.4986399
Processed: https://doi.org/10.5281/zenodo.4730312
Media
Video and slide deck: https://alexhenderson.info
Jiayi (Jennie) Tang Alex Henderson Peter Gardner
https://gardner-lab.com
https://alexhenderson.info
https://twitter.com/PeterGardnerUoM
https://twitter.com/AlexHenderson00
6. ENSEMBLE METHODS IN MACHINE LEARNING
Machine learning: Collection (committee) of weak
learners
7. LEARNERS: THE WEAK VERSUS THE STRONG
One strong learner
Difficult to build
Need lots of information
Specialised to problem
Can overfit
Many weak learners
Easy to build
Each learner is barely better than guessing
Generality
8. LEARNERS: THE WEAK VERSUS THE STRONG
One strong learner
Difficult to build
Need lots of information
Specialised to problem
Can overfit
Many weak learners
Easy to build
Each learner is barely better than guessing
Generality
The Incredible Hulk. Avengers: Endgame V For Vendetta
9. DECISION TREE
Most common weak learner
Each node defines a question
Variables can be Boolean,
categories, or numeric ranges
Most critical question first, less
important questions follow
https://medium.datadriveninvestor.com/decision-trees-lesson-101-f00dad6cba21
10. RANDOM
FORESTS™
Ensemble (collection) of
decision trees
Each tree gets different
variables
Many branches
Many leaves
Trees built in parallel
Example of ‘bagging’
(bootstrap aggregation)
Trademark Leo Breiman & Adele
Cutler
https://www.flickr.com/photos/125012285@N07/14478851169/in/photostream/
11. DECISION STUMP
Very weak learner (~51%)
Only most critical question
considered
https://medium.datadriveninvestor.com/decision-trees-lesson-101-f00dad6cba21
12. ADABOOST
Ensemble of decision tree
stumps
Each tree gets different
variables
One decision
Two leaves
Iterative
Example of ‘boosting’
Effectively
a forest of stumps
https://www.conserve-energy-future.com/causes-effects-solutions-of-deforestation.php
17. TISSUE DATA
Breast cancer TMA
Biomax BR20832
40 cores stage II breast cancer
10 cores normal-associated tissue
Top: H&E images
A = cancer
B = normal associated tissue
Bottom: FT-IR images
Red = cancerous epithelium
Purple = cancerous stroma
Green = NAT epithelium
Orange = NAT stroma
https://www.biomax.us/tissue-arrays/Breast/BR20832
18. UNDER-SAMPLING
Easiest method to understand
Determine class with the fewest members
Randomly delete members of other classes until
all have the same number
Discards much of the data, training set reduced
Resulting model is weaker
Remains unbiased, but with higher variance
0
200
400
600
800
1000
Class 1 Class 2 Class 3 Class 4
Under-sampling
Data retained Data discarded
19. OVER-SAMPLING
Determine class with the most members
Duplicate members of other classes to reach this
number
Increases training data size
Many approaches 0
200
400
600
800
1000
Class 1 Class 2 Class 3 Class 4
Over-sampling
Original data Duplicates
20. OVER-SAMPLING APPROACHES
Class 1 – majority – N samples
Class 2 – minority – P samples
N >> P
• Duplicate all samples in class 2, N-P times
• Randomly select N samples from class 2
(sampling with replacement)
• Randomly select N-P samples from
class 2 and append to original class 2
• Interpolate some class 2 members and append
(example is SMOTE†
)
†BMC Bioinformatics, 2013, 14, 106. https://doi.org/10.1186/1471-2105-14-106
Other approaches are available
21. OVER-SAMPLING APPROACHES
Assume class 1 is majority
with N samples
Class 2 is minority
with P samples
N >> P
• Duplicate all samples in class 2, N-P times
• Randomly select N samples from class 2
(sampling with replacement)
• Randomly select N-P samples from
class 2 and append to original class 2
• Interpolate some class 2 members and append
(example is SMOTE)
https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
All data in minority class is represented. Duplicates are ‘random sampling with replacement’ (Bootstrap)
26. OVER-SAMPLING TRAINING SETS
Data sets are balanced, but can become large
All cancer spectra are unique, but many NAT spectra are duplicates
Initia
l
ratio
Num
can
Over-sampled Nu
m
NAT
Tota
l
50:50 2500 U U U U U 2500 5000
60:40 3000 U U U U D D 3000 6000
70:30 3500 U U U D D D D 3500 7000
80:20 4000 U U D D D D D D 4000 8000
90:10 4500 U D D D D D D D D 4500 9000
30. CONCLUSION
Both models correctly classify > 90% of samples
Models built with unbalanced classes can be misleading
AdaBoost slightly better at classification
Random Forests remains relatively stable until very small class sizes
AdaBoost with over-sampling could be a good combination, particularly when our class imbalance is high
31. You don't understand! I could’ve been a contender. I could've had class… Real
class. On the Waterfront
32. CONCLUSION
Both models correctly classify > 90% of samples
Models built with unbalanced classes can be misleading
AdaBoost slightly better at classification
Random Forests remains relatively stable until very small class sizes
AdaBoost with over-sampling could be a good combination, particularly when our class imbalance is high
Editor's Notes
Hello. I’d like to thank the organizers for giving me this opportunity to tell you about some work we’ve been doing in Manchester, using machine learning to look at unbalanced classes.
My name is Alex Henderson, and this presentation outlines work recently published in the Analyst, which is available Open Access.
Both the raw, and processed, data are available on Zenodo, and this video and slide deck will be made available from my and the group’s website, following the conference.
I think it’s only fair to point out that Jennie did all the work, and I only hope I can do a good job of representing her today!
So, what is the class imbalance problem?
Consider a piece of tissue, stained with H&E to highlight the cell morphology.
We can analyse this using infrared, [CLICK] and build a model to identify various cell types. Note, however, that there is a wide range in the composition of the tissue. Some cell types only appear in very low abundance.
And it’s this difference in the number of spectra in each class, that can present a problem when we come to build our chemometric models.
In this study we have explored adaptive boosting - or AdaBoost - and compared its performance against the Random Forests algorithm, now used by a number of groups, including ourselves.
Both AdaBoost and Random Forests fall into the category of Ensemble Methods.
An ‘ensemble’ is just another way of saying ‘a collection’, where the members of that collection are of the same type, but possibly different state.
Ensemble methods use collections of what are called - ‘weak learners’ - to attack the problem at hand.
These methods use many weak learners, rather than a single strong learner.
Strong learners can be difficult to build and may require a lot of data. They are tuned to the problem at hand, but can overfit if tuned too closely.
Weak learners on the other hand are relatively easy to build. The term ‘weak learner’ comes from the idea that they are not really very good at learning! A single weak learner has a success rate of barely over 50%; only just better than guessing, or tossing a coin.
However, when brought together en masse, they gel to form good models. Better than the sum of their parts, you could say!
So, while a strong learner will be useful for specific challenges, weak learners benefit from: ‘the wisdom of the crowds’.
The most common weak learner in ensemble learning is the decision tree, and these are used in both Random Forests and AdaBoost.
Here, the variable that best separates the training set data, becomes the ‘root node’. The data is then split into different branches. Each branch is considered separately, and the best variable for that branch becomes the decision point for the next split. The same variables can appear in different branches, in different orders, since the source data is changing after each split.
Eventually no further splits are required, and the outcome appears in leaf nodes.
Remember that these trees are not meant to be very good at making decisions! That’s the whole point!
A random forest is a collection of decision trees, with each tree being given a different set of variables. This prevents any single variable from dominating in the resulting model.
For boosting approaches, AdaBoost being the first and most common, we make the decision trees even more ‘dumb’, by only allowing a single decision split. This produces, what’s called a ‘decision tree stump’. The root node is still defined around the variable that is most ‘important’ in separating the data in the training set, but other variables don’t get a look in. Because there is only one split, the tree can’t ‘refine’ its decision, so it just has to go, with what it’s got.
So, AdaBoost uses a collection of decision tree stumps, rather than full trees. Each tree gets different variables in the same way as Random Forests, but the trees only get to make a single choice.
The main difference between boosting techniques, such as AdaBoost, and a bagging approach like Random Forests, is that that boosting is ‘iterative’.
So AdaBoost is effectively a forest of stumps…
[CLICK] …not to be confused with…
…a Forrest of Gumps!
Sorry, couldn’t resist!
The name AdaBoost is short for Adaptive Boosting. In this case the adaptive part is introduced by iteration and weighting.
[CLICK] To start with all samples are weighted equally. The decision tree (stump) then identifies a parameter that can split the data into class A or class B; in this case triangles and squares.
Any samples that were misclassified are then upweighted, with those correctly classified being downweighted. These modified data are then presented to a new decision tree. Since the weights on the previously misclassified samples are now higher, they are more likely to be correctly classified. Now, it is important to point out here that we’re not multiplying the spectral data points by this weighting; we’re changing their relative importance to the algorithm.
Next the misclassified samples from this second iteration are upweighted, with the correctly classified samples being downweighted, and we go for a third iteration.
After three iterations we stop, we combine the iterations and produce the ‘outcome’ of that tree ‘set’.
So, by iterating, and biasing each iteration in favour of samples there were wrongly classified in previous steps, we produce a stronger classifier. This might not be a VERY strong classifier, but it will be used in combination with others in the overall algorithm.
As with the Random Forests approach, when we introduce test data, each tree (or tree set) gets a vote for whichever class it thinks that test sample should fall into. There are various metrics that can be used here, but the majority vote is the easiest to think about and easiest to apply.
So, now we have our problem, and two potential algorithms to apply, how well do they work when presented with unbalanced data?
To assess this we used a tissue microarray containing breast cancer tissue from 208 patients. We selected 40 cores relating to cancer and 10 relating to normal associated tissue. Normal associated tissue is tissue from regions adjacent to a tumour from non-malignant cores. You don’t usually get access to healthy tissue. After all, most people don’t want to have a biopsy unless there is some VERY GOOD underlying medical reason!
We manually annotated these tissues, according to W.H.O. guidelines, and identified regions corresponding to cancerous epithelium and normal associated epithelium. We also annotated normal and cancerous stroma, but those spectra were not included in this study.
So, the first sampling method we will take a look at is under-sampling.
In this method we identify the class with the fewest members and reduce all other classes to that number. This is simple to understand and to apply. The downside is that we tend to throw away lots of data. If the smallest class is much smaller than the others, we will end up discarding most of the data acquired. This has the knock-on effect of weakening the model because the data available for the training set will be a smaller sample of the acquired population.
The good thing about under-sampling is that all the spectra remain unique, there are no duplicates. The model will be unbiased, but will have a higher variance.
The opposite of under-sampling is, of course, over-sampling!
In this scenario we increase the numbers in each of the minority classes to match the class with the most members. This will increase the size of the training set, which could be problematic for the target algorithm or computational resource available.
The biggest problem, however, comes when we have to decide on where these increased numbers will come from.
There are lots of methods we can choose to over-sample our data. Here I’ve listed four.
The first simply takes a copy of the smaller class and appends it to itself. We can repeat this until we reach the size of the larger class. Of course we will never get an exact match, well pretty unlikely anyway, so we need a method of dealing with the over/under hang. We can simply ignore this and say our classes are now much more similar, or we can use some form of randomisation to get the exact number.
This has the benefit of each spectrum in the minority class being equally represented in the newly generated group; well without taking into account the randomness if that’s the way we want to go.
And, of course, there are other approaches we could take.
The second approach uses something like a Bootstrap sampling approach, which is ‘sampling with replacement’, to randomly re-generate the minority class. Bootstrap has low bias and variance, but there could be samples, that never actually get selected. That means we are throwing away some original data.
Method three is similar to method two, except we ensure all the minority class are included and only Bootstrap the required difference.
Then there is the option of changing the data. The first three methods simply selected (or didn’t select) the spectra in the minority class. Another approach is to interpolate some of the spectra to generate data that was never actually acquired. One of these methods is called SMOTE and is discussed in a paper by Blagus and Lusa.
However, in this work we decided to go with method 3. This has the advantage of ensuring all the data acquired, relating to the minority class, are actually included in the training set, and any duplication being handled by the well-respected Bootstrap method.
So how did we get on?
First I should mention that the same independent test set was used in all cases. In addition we tried as much as possible to create training sets that were built by either expanding or contracting existing training sets, rather than generating each one randomly. This has the advantage of showing the variation in having larger or smaller data sets, rather than new ones created randomly. If we were to create lots of random data sets, some trends might be hidden.
In all cases the exercise of generating training sets and testing them was repeated 5 times. But with the same independent test set used in each case.
So, it’s useful to get some ground truth, so we know whether any changes we see as a function of sampling, are actually due to the change in the size of the training sets themselves.
We created balanced sets of different size from 2,500 per class, down to 10. As you can see both algorithms perform surprisingly well. It’s not until we get down to 100 samples per class that AdaBoost starts to fall over. At this point all samples are being classified as normal-associated. However, when we have large numbers per class it performs a little better than Random Forests. Although, you have to say that classification accuracies of 90% and over, is really rather good: it is worth pointing out here that all these data are generated from the same TMA, so accuracies of this level will probably not be maintained across different samples, instruments etc. However, using the same sample has the benefit of removing these additional sources of error, so we can concentrate on the performance of the algorithms themselves, and the sampling methods.
On the right, we can see that the Random Forests method stays pretty strong beyond 100 samples, and can even generate a reasonable result with only 10 samples per class!
So, taking a closer view of the left hand side of that plot, we generated some under-sampled training data. Each of these training sets has the same number of cancer and normal associated spectra, but as the size of the minority set gets smaller, you can see we end up throwing away lots of the majority class to match.
AdaBoost appears to out-perform Random Forests with the normal-associated tissue being almost perfectly classified for all samples sizes. Although to be fair, they both do pretty well. The cancer samples do not perform quite as well, so more are being misclassified as the training set gets smaller. The variability in the Random Forests data is slightly larger too.
Over-sampling is a bit more complicated. The red box in the table on the right indicates the spectra that are unique. That includes all the cancer spectra and normal-associated spectra originally in the samples. In order to over-sample we randomly duplicate more and more of the normal-associated, to keep up with the growing cancer data set. The dark blue squares labelled - D - represent duplicates, while the light blue squares labelled - U - represent the original spectra. As you can see, by the time we have a ratio of 9 to 1 we have 4,500 cancer spectra, each of which is unique, but only 500 unique normal-associated spectra. From these 500 we now need to randomly select another 4,000 spectra.
So, how does this duplication affect the outcome? Well, the AdaBoost method still seems to perform strongly. Note that the two lines cross over when our ratio is very large. This is probably due to the duplication in the normal-associated data leading to overfitting and that being reflected in its inability to correctly classify the test data.
The Random Forests method performs less well, and appears to be more influenced by the duplication than AdaBoost.
It’s worth taking a moment to compare the two sampling methods, using the same algorithm. With AdaBoost it looks like over-sampling works best and the level of classification accuracy remains fairly constant as the sample sizes change.
However, with Random Forests we get a different answer. Note how under-sampling improves the normal-associated accuracy, while the cancer samples become less well classified. However, with over-sampling we get the opposite effect. The cancer samples get better, but the normal-associated fall away.
This is worrying because it means we could get a different answer depending on the choice of algorithm AND the choice of sampling method.
So, what did we learn from doing this work?
Firstly, on this, admittedly, limited, data set, we can see that infrared does a good job of classifying cancer from non-cancer data. We have been discussing values in the 80-95% accuracy range, and, even allowing, for the use of a single instrument and a single TMA, this is an indication that IR is useful here.
-However, we need to be careful in our choice of algorithm and sampling method because our results could be misleading.
-AdaBoost seems to be slightly better at classification, and both AdaBoost and Random Forests will give good accuracy down to about 100 spectra per class (under-sampled). And, Random Forests remains relatively stable until we reach very small class sizes; in the 10s
-AdaBoost seems to be stable to over-sampling, while Random Forests is only stable for ranges that are relatively close; down to about 70:30.
Coming back to our original question, for unbalanced classes, will AdaBoost come to the rescue?
Well, I think the jury is still out. However, I think AdaBoost IS a contender, and we should do more work in this area to see how useful it can be