Here we discuss the issues when applying Random Forests and AdaBoost data analysis methods to infrared spectroscopy data sets, where the numbers in each class vary.
Invited presentation at the 11th International Conference on Advanced Vibrational Spectroscopy (ICAVS-11), 23-26 August 2021. This was a virtual conference.
This presentation relates to our paper in Analyst "Exploring AdaBoost and Random Forests machine learning approaches for infrared pathology on unbalanced data sets" by Jiayi Tang, Alex Henderson and Peter Gardner.
Paper: https://doi.org/10.1039/D0AN02155E (available open access, CC-BY).
Raw data: https://doi.org/10.5281/zenodo.4986399 (CC-BY)
Processed data, and MATLAB source code: https://doi.org/10.5281/zenodo.4730312 (CC-BY)
Abstract
The use of infrared spectroscopy to augment decision-making in histopathology is a promising direction for the diagnosis of many disease types. Hyperspectral images of healthy and diseased tissue, generated by infrared spectroscopy, are used to build chemometric models that can provide objective metrics of disease state. It is important to build robust and stable models to provide confidence to the end user. The data used to develop such models can have a variety of characteristics which can pose problems to many model-building approaches. Here we have compared the performance of two machine learning algorithms – AdaBoost and Random Forests – on a variety of non-uniform data sets. Using samples of breast cancer tissue, we devised a range of training data capable of describing the problem space. Models were constructed from these training sets and their characteristics compared. In terms of separating infrared spectra of cancerous epithelium tissue from normal-associated tissue on the tissue microarray, both AdaBoost and Random Forests algorithms were shown to give excellent classification performance (over 95% accuracy) in this study. AdaBoost models were more robust when datasets with large imbalance were provided. The outcomes of this work are a measure of classification accuracy as a function of training data available, and a clear recommendation for choice of machine learning approach.
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...Golden Helix Inc
Population structure and inbreeding can confound results from a standard genome-wide association test. Accounting for the random effect of relatedness can lead to lower false discovery rates and identify the causative markers without over-correcting and dampening the true signal.
This presentation will review four different methods of analyzing genotype data while accounting for random effects of relatedness. Methods include PCA analysis with Linear Regression, GBLUP, EMMAX, and MLMM. Comparisons will be made using data from the Sheep HapMap project and a simulated phenotype. After presenting the various methods, we will discuss how these results can be obtained using Golden Helix SNP & Variation Suite (SVS) software and how SVS can be used to compare and contrast the results.
Status and prospects of association mapping in crop plantsJyoti Prakash Sahoo
Polygenic inheritance of agronomic traits - controlled by multiple genes whose expression is affected by many factors. Hence phenotypic selection becomes tedious job.
Family mapping (Limitations- Biparental population, Low resolution, Analysis of only 2 alleles, time consuming).
Population or Association mapping (I) increased mapping resolution, (ii) reduced research time, and (iii) greater allele number (Yu and Buckler, 2006).
Detecting clinically actionable somatic structural aberrations from targeted ...Ronak Shah
Structural aberrations including deletions, insertions, inversions, tandem duplications, translocations, and more complex rearrangements constitute a frequent type of alteration in human tumors. Here, we sought to explore the potential to discover such events from targeted DNA sequence data in our CLIA-compliant molecular diagnostics laboratory. To detect somatic structural aberrations in individual tumors, we have developed an analytic framework in Perl & Python to detect these events in data generated by a hybridization capture-based, targeted sequencing clinical assay (MSK-IMPACT), which can reveal structural rearrangements as small as 500bp.
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...Golden Helix Inc
Population structure and inbreeding can confound results from a standard genome-wide association test. Accounting for the random effect of relatedness can lead to lower false discovery rates and identify the causative markers without over-correcting and dampening the true signal.
This presentation will review four different methods of analyzing genotype data while accounting for random effects of relatedness. Methods include PCA analysis with Linear Regression, GBLUP, EMMAX, and MLMM. Comparisons will be made using data from the Sheep HapMap project and a simulated phenotype. After presenting the various methods, we will discuss how these results can be obtained using Golden Helix SNP & Variation Suite (SVS) software and how SVS can be used to compare and contrast the results.
Status and prospects of association mapping in crop plantsJyoti Prakash Sahoo
Polygenic inheritance of agronomic traits - controlled by multiple genes whose expression is affected by many factors. Hence phenotypic selection becomes tedious job.
Family mapping (Limitations- Biparental population, Low resolution, Analysis of only 2 alleles, time consuming).
Population or Association mapping (I) increased mapping resolution, (ii) reduced research time, and (iii) greater allele number (Yu and Buckler, 2006).
Detecting clinically actionable somatic structural aberrations from targeted ...Ronak Shah
Structural aberrations including deletions, insertions, inversions, tandem duplications, translocations, and more complex rearrangements constitute a frequent type of alteration in human tumors. Here, we sought to explore the potential to discover such events from targeted DNA sequence data in our CLIA-compliant molecular diagnostics laboratory. To detect somatic structural aberrations in individual tumors, we have developed an analytic framework in Perl & Python to detect these events in data generated by a hybridization capture-based, targeted sequencing clinical assay (MSK-IMPACT), which can reveal structural rearrangements as small as 500bp.
2016 Presentation at the University of Hawaii Cancer CenterCasey Greene
Date: February 19, 2016
Time: 10:30 am
Place: University of Hawaii Cancer Center 701 Ilalo Street, Sullivan Conference Room
Details: Dr. Casey Greene
Department of Systems Pharmacology and Translational Therapeutics
Department of Genetics
University of Pennsylvania
Moore Investigator, Gordon and Betty Moore Foundation
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
These slides are part of a presentation I gave on March 2010 at the BioInformatics and Genome Research Open Club at the Weizmann Institute of Science, Israel.
In these slides my student and I describe two web-applications for microarray and gene/protein set analysis,
ArrayMining.net and TopoGSA. These use ensemble and consensus methods as well as the
possibility of modular combinations of different analysis techniques for an integrative view of
(microarray-based) gene sets, interlinking transcriptomics with proteomics data sources. This integrative process uses tools from different fields, e.g. statistics, optimisation and network
topological studies. As an example for these integrative techniques, we use a microarray
consensus-clustering approach based on Simulated Annealing, which is part of the ArrayMining.net
Class Discovery Analysis module, and show how this approach can be combined in a modular
fashion with a prior gene set analysis. The results reveal that improved cluster validity indices can be obtained by merging the two methods, and provide pointers to distinct sub-classes within pre-defined tumour categories for a breast cancer dataset by the Nottingham Queens Medical Centre.
In the second part of the talk, I show how results from a supervised
microarray feature selection analysis on ArrayMining.net can be investigated in further detail with
TopoGSA, a new web-tool for network topological analysis of gene/protein sets mapped on a
comprehensive human protein-protein interaction network. I discuss results from a TopoGSA
analysis of the complete set of genes currently known to be mutated in cancer.
A class skew-insensitive ACO-based decision tree algorithm for imbalanced dat...nooriasukmaningtyas
Ant-tree-miner (ATM) has an advantage over the conventional decision tree algorithm in terms of feature selection. However, real world applications commonly involved imbalanced class problem where the classes have different importance. This condition impeded the entropy-based heuristic of existing ATM algorithm to develop effective decision boundaries due to its biasness towards the dominant class. Consequently, the induced decision trees are dominated by the majority class which lack in predictive ability on the rare class. This study proposed an enhanced algorithm called hellingerant-tree-miner (HATM) which is inspired by ant colony optimization (ACO) metaheuristic for imbalanced learning using decision tree classification algorithm. The proposed algorithm was compared to the existing algorithm, ATM in nine (9) publicly available imbalanced data sets. Simulation study reveals the superiority of HATM when the sample size increases with skewed class (Imbalanced Ratio < 50%). Experimental results demonstrate the performance of the existing algorithm measured by BACC has been improved due to the class skew-insensitiveness of hellinger distance. The statistical significance test shows that HATM has higher mean BACC score than ATM.
Basics of Data Analysis in BioinformaticsElena Sügis
Presentation gives introduction to the Basics of Data Analysis in Bioinformatics.
The following topics are covered:
Data acquisition
Data summary(selecting the needed column/rows from the file and showing basic descriptive statistics)
Preprocessing (missing values imputation, data normalization, etc.)
Principal Component Analysis
Data Clustering and cluster annotation (k-means, hierarchical)
Cluster annotations
large data set is not available for some disease such as Brain Tumor. This and part2 presentation shows how to find "Actionable solution from a difficult cancer dataset
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
FAIRSpectra - Towards a common data file format for SIMS imagesAlex Henderson
Presentation from the 101st IUVSTA Workshop on High performance SIMS instrumentation and machine learning / artificial intelligence methods for complex data.
This presentation describes the issues relating to storing and sharing data from Secondary Ion Mass Spectrometry experiments, and some potential solutions.
2016 Presentation at the University of Hawaii Cancer CenterCasey Greene
Date: February 19, 2016
Time: 10:30 am
Place: University of Hawaii Cancer Center 701 Ilalo Street, Sullivan Conference Room
Details: Dr. Casey Greene
Department of Systems Pharmacology and Translational Therapeutics
Department of Genetics
University of Pennsylvania
Moore Investigator, Gordon and Betty Moore Foundation
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
These slides are part of a presentation I gave on March 2010 at the BioInformatics and Genome Research Open Club at the Weizmann Institute of Science, Israel.
In these slides my student and I describe two web-applications for microarray and gene/protein set analysis,
ArrayMining.net and TopoGSA. These use ensemble and consensus methods as well as the
possibility of modular combinations of different analysis techniques for an integrative view of
(microarray-based) gene sets, interlinking transcriptomics with proteomics data sources. This integrative process uses tools from different fields, e.g. statistics, optimisation and network
topological studies. As an example for these integrative techniques, we use a microarray
consensus-clustering approach based on Simulated Annealing, which is part of the ArrayMining.net
Class Discovery Analysis module, and show how this approach can be combined in a modular
fashion with a prior gene set analysis. The results reveal that improved cluster validity indices can be obtained by merging the two methods, and provide pointers to distinct sub-classes within pre-defined tumour categories for a breast cancer dataset by the Nottingham Queens Medical Centre.
In the second part of the talk, I show how results from a supervised
microarray feature selection analysis on ArrayMining.net can be investigated in further detail with
TopoGSA, a new web-tool for network topological analysis of gene/protein sets mapped on a
comprehensive human protein-protein interaction network. I discuss results from a TopoGSA
analysis of the complete set of genes currently known to be mutated in cancer.
A class skew-insensitive ACO-based decision tree algorithm for imbalanced dat...nooriasukmaningtyas
Ant-tree-miner (ATM) has an advantage over the conventional decision tree algorithm in terms of feature selection. However, real world applications commonly involved imbalanced class problem where the classes have different importance. This condition impeded the entropy-based heuristic of existing ATM algorithm to develop effective decision boundaries due to its biasness towards the dominant class. Consequently, the induced decision trees are dominated by the majority class which lack in predictive ability on the rare class. This study proposed an enhanced algorithm called hellingerant-tree-miner (HATM) which is inspired by ant colony optimization (ACO) metaheuristic for imbalanced learning using decision tree classification algorithm. The proposed algorithm was compared to the existing algorithm, ATM in nine (9) publicly available imbalanced data sets. Simulation study reveals the superiority of HATM when the sample size increases with skewed class (Imbalanced Ratio < 50%). Experimental results demonstrate the performance of the existing algorithm measured by BACC has been improved due to the class skew-insensitiveness of hellinger distance. The statistical significance test shows that HATM has higher mean BACC score than ATM.
Basics of Data Analysis in BioinformaticsElena Sügis
Presentation gives introduction to the Basics of Data Analysis in Bioinformatics.
The following topics are covered:
Data acquisition
Data summary(selecting the needed column/rows from the file and showing basic descriptive statistics)
Preprocessing (missing values imputation, data normalization, etc.)
Principal Component Analysis
Data Clustering and cluster annotation (k-means, hierarchical)
Cluster annotations
large data set is not available for some disease such as Brain Tumor. This and part2 presentation shows how to find "Actionable solution from a difficult cancer dataset
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
FAIRSpectra - Towards a common data file format for SIMS imagesAlex Henderson
Presentation from the 101st IUVSTA Workshop on High performance SIMS instrumentation and machine learning / artificial intelligence methods for complex data.
This presentation describes the issues relating to storing and sharing data from Secondary Ion Mass Spectrometry experiments, and some potential solutions.
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
Presentation from the "Opening up Research" conference organised by the University of Manchester's Office for Open Research.
24 April 2024
https://www.openresearch.manchester.ac.uk/
https://fairspectra.net
https://alexhenderson.info
Design decisions relating to ChiToolbox, presented at the Kick-off Meeting for OpenVibSpec, 3 February 2020, in Bochum, Germany.
ChiToolbox is an open source MATLAB toolbox for handling data from hyperspectral imaging experiments.
https://bitbucket.org/AlexHenderson/ChiToolbox/
https://openvibspec.org/
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopyAlex Henderson
Presentation given at SPEC 2014, Krakow, Poland. 17-22 August 2014
[some slides do not display correctly, download the pdf for better quality]
In our day-to-day practice we collect data, convert this to information, hopefully extract knowledge, and then pass this on to our peers, thereby advancing the global understanding of our field. This is a very linear process. What if we were to share our data? Have others take our information and combine it with their own? Such a branched process would likely result in more rapid discoveries and, potentially, a greater understanding. In order to facilitate data sharing we must define at least two interfaces with our peers;
1. A mechanism of them understanding the language of our data
2. A mechanism of passing on the context of our experiment
Of course, both of these must work in reverse; we must understand their data and also their experimental context. These are separate yet related ideas. Our data are meaningless without context, but because we are ‘close to the action’ we do not explicitly document them.
Recording the nature of our experiments can have benefits closer to home. Too often we find ourselves searching for results that we know we recorded, but have difficulty locating. Then there is the issue of recalling the exact experimental procedure involved in the sample preparation or data reduction. Documentation of these will lead to better laboratory practice all round.
Earlier this year, a network of academic, clinical and industrial groups was constituted in the UK, with some international partners, to consider how best to push forward the use of infrared and Raman spectroscopies in the clinical arena: CLIRSPEC [1]. One of the work packages of the CLIRSPEC network is the development of standard protocols for data sharing. The work package falls, initially, into two parts;
1. How to easily and uniformly transfer our data between research teams and, by association, into an accessible archive.
2. How to record the provenance of our samples, the treatments they undergo, the experiments performed on them and the manner the resulting data was manipulated: the metadata.
In this presentation we will outline the current position of the CLIRSPEC work package, both in terms of the performance of various candidate data formats (JCAMP-DX, SPC, netCDF, …), and the options for the recording of the metadata associated with the experimental procedure (controlled vocabularies, XML, RDF, ISA-TAB, …). Included here is the concept of a minimum reporting requirement for IR and Raman, particularly in the clinical arena, that we can all try to meet.
None of this can happen without the buy-in of the community. We seek to engage everyone in a dialogue that will result in more consistent, and hopefully better, practice across all laboratories to further our understanding of clinical vibrational spectroscopy.
[1] http://clirspec.org
This presentation covers factors that influence the form of a Static SIMS spectrum and various issues that may arise in its interpretation.
Presented at the Joint IAEA-SPIRIT-Japan Technical Meeting on Development and Utilization of MeV-SIMS. Inter-University Centre, Dubrovnik, Croatia. 21-25 May 2012
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
2. BACKGROUND AND RESOURCES
Exploring AdaBoost and Random Forests machine
learning approaches for infrared pathology on
unbalanced data sets
Analyst, May 2021
Open access: https://doi.org/10.1039/D0AN02155E
Data and source code
Raw: https://doi.org/10.5281/zenodo.4986399
Processed: https://doi.org/10.5281/zenodo.4730312
Media
Video and slide deck: https://alexhenderson.info
Jiayi (Jennie) Tang Alex Henderson Peter Gardner
https://gardner-lab.com
https://alexhenderson.info
https://twitter.com/PeterGardnerUoM
https://twitter.com/AlexHenderson00
6. ENSEMBLE METHODS IN MACHINE LEARNING
Machine learning: Collection (committee) of weak
learners
7. LEARNERS: THE WEAK VERSUS THE STRONG
One strong learner
Difficult to build
Need lots of information
Specialised to problem
Can overfit
Many weak learners
Easy to build
Each learner is barely better than guessing
Generality
8. LEARNERS: THE WEAK VERSUS THE STRONG
One strong learner
Difficult to build
Need lots of information
Specialised to problem
Can overfit
Many weak learners
Easy to build
Each learner is barely better than guessing
Generality
The Incredible Hulk. Avengers: Endgame V For Vendetta
9. DECISION TREE
Most common weak learner
Each node defines a question
Variables can be Boolean,
categories, or numeric ranges
Most critical question first, less
important questions follow
https://medium.datadriveninvestor.com/decision-trees-lesson-101-f00dad6cba21
10. RANDOM
FORESTS™
Ensemble (collection) of
decision trees
Each tree gets different
variables
Many branches
Many leaves
Trees built in parallel
Example of ‘bagging’
(bootstrap aggregation)
Trademark Leo Breiman & Adele
Cutler
https://www.flickr.com/photos/125012285@N07/14478851169/in/photostream/
11. DECISION STUMP
Very weak learner (~51%)
Only most critical question
considered
https://medium.datadriveninvestor.com/decision-trees-lesson-101-f00dad6cba21
12. ADABOOST
Ensemble of decision tree
stumps
Each tree gets different
variables
One decision
Two leaves
Iterative
Example of ‘boosting’
Effectively
a forest of stumps
https://www.conserve-energy-future.com/causes-effects-solutions-of-deforestation.php
17. TISSUE DATA
Breast cancer TMA
Biomax BR20832
40 cores stage II breast cancer
10 cores normal-associated tissue
Top: H&E images
A = cancer
B = normal associated tissue
Bottom: FT-IR images
Red = cancerous epithelium
Purple = cancerous stroma
Green = NAT epithelium
Orange = NAT stroma
https://www.biomax.us/tissue-arrays/Breast/BR20832
18. UNDER-SAMPLING
Easiest method to understand
Determine class with the fewest members
Randomly delete members of other classes until
all have the same number
Discards much of the data, training set reduced
Resulting model is weaker
Remains unbiased, but with higher variance
0
200
400
600
800
1000
Class 1 Class 2 Class 3 Class 4
Under-sampling
Data retained Data discarded
19. OVER-SAMPLING
Determine class with the most members
Duplicate members of other classes to reach this
number
Increases training data size
Many approaches 0
200
400
600
800
1000
Class 1 Class 2 Class 3 Class 4
Over-sampling
Original data Duplicates
20. OVER-SAMPLING APPROACHES
Class 1 – majority – N samples
Class 2 – minority – P samples
N >> P
• Duplicate all samples in class 2, N-P times
• Randomly select N samples from class 2
(sampling with replacement)
• Randomly select N-P samples from
class 2 and append to original class 2
• Interpolate some class 2 members and append
(example is SMOTE†
)
†BMC Bioinformatics, 2013, 14, 106. https://doi.org/10.1186/1471-2105-14-106
Other approaches are available
21. OVER-SAMPLING APPROACHES
Assume class 1 is majority
with N samples
Class 2 is minority
with P samples
N >> P
• Duplicate all samples in class 2, N-P times
• Randomly select N samples from class 2
(sampling with replacement)
• Randomly select N-P samples from
class 2 and append to original class 2
• Interpolate some class 2 members and append
(example is SMOTE)
https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
All data in minority class is represented. Duplicates are ‘random sampling with replacement’ (Bootstrap)
26. OVER-SAMPLING TRAINING SETS
Data sets are balanced, but can become large
All cancer spectra are unique, but many NAT spectra are duplicates
Initia
l
ratio
Num
can
Over-sampled Nu
m
NAT
Tota
l
50:50 2500 U U U U U 2500 5000
60:40 3000 U U U U D D 3000 6000
70:30 3500 U U U D D D D 3500 7000
80:20 4000 U U D D D D D D 4000 8000
90:10 4500 U D D D D D D D D 4500 9000
30. CONCLUSION
Both models correctly classify > 90% of samples
Models built with unbalanced classes can be misleading
AdaBoost slightly better at classification
Random Forests remains relatively stable until very small class sizes
AdaBoost with over-sampling could be a good combination, particularly when our class imbalance is high
31. You don't understand! I could’ve been a contender. I could've had class… Real
class. On the Waterfront
32. CONCLUSION
Both models correctly classify > 90% of samples
Models built with unbalanced classes can be misleading
AdaBoost slightly better at classification
Random Forests remains relatively stable until very small class sizes
AdaBoost with over-sampling could be a good combination, particularly when our class imbalance is high
Editor's Notes
Hello. I’d like to thank the organizers for giving me this opportunity to tell you about some work we’ve been doing in Manchester, using machine learning to look at unbalanced classes.
My name is Alex Henderson, and this presentation outlines work recently published in the Analyst, which is available Open Access.
Both the raw, and processed, data are available on Zenodo, and this video and slide deck will be made available from my and the group’s website, following the conference.
I think it’s only fair to point out that Jennie did all the work, and I only hope I can do a good job of representing her today!
So, what is the class imbalance problem?
Consider a piece of tissue, stained with H&E to highlight the cell morphology.
We can analyse this using infrared, [CLICK] and build a model to identify various cell types. Note, however, that there is a wide range in the composition of the tissue. Some cell types only appear in very low abundance.
And it’s this difference in the number of spectra in each class, that can present a problem when we come to build our chemometric models.
In this study we have explored adaptive boosting - or AdaBoost - and compared its performance against the Random Forests algorithm, now used by a number of groups, including ourselves.
Both AdaBoost and Random Forests fall into the category of Ensemble Methods.
An ‘ensemble’ is just another way of saying ‘a collection’, where the members of that collection are of the same type, but possibly different state.
Ensemble methods use collections of what are called - ‘weak learners’ - to attack the problem at hand.
These methods use many weak learners, rather than a single strong learner.
Strong learners can be difficult to build and may require a lot of data. They are tuned to the problem at hand, but can overfit if tuned too closely.
Weak learners on the other hand are relatively easy to build. The term ‘weak learner’ comes from the idea that they are not really very good at learning! A single weak learner has a success rate of barely over 50%; only just better than guessing, or tossing a coin.
However, when brought together en masse, they gel to form good models. Better than the sum of their parts, you could say!
So, while a strong learner will be useful for specific challenges, weak learners benefit from: ‘the wisdom of the crowds’.
The most common weak learner in ensemble learning is the decision tree, and these are used in both Random Forests and AdaBoost.
Here, the variable that best separates the training set data, becomes the ‘root node’. The data is then split into different branches. Each branch is considered separately, and the best variable for that branch becomes the decision point for the next split. The same variables can appear in different branches, in different orders, since the source data is changing after each split.
Eventually no further splits are required, and the outcome appears in leaf nodes.
Remember that these trees are not meant to be very good at making decisions! That’s the whole point!
A random forest is a collection of decision trees, with each tree being given a different set of variables. This prevents any single variable from dominating in the resulting model.
For boosting approaches, AdaBoost being the first and most common, we make the decision trees even more ‘dumb’, by only allowing a single decision split. This produces, what’s called a ‘decision tree stump’. The root node is still defined around the variable that is most ‘important’ in separating the data in the training set, but other variables don’t get a look in. Because there is only one split, the tree can’t ‘refine’ its decision, so it just has to go, with what it’s got.
So, AdaBoost uses a collection of decision tree stumps, rather than full trees. Each tree gets different variables in the same way as Random Forests, but the trees only get to make a single choice.
The main difference between boosting techniques, such as AdaBoost, and a bagging approach like Random Forests, is that that boosting is ‘iterative’.
So AdaBoost is effectively a forest of stumps…
[CLICK] …not to be confused with…
…a Forrest of Gumps!
Sorry, couldn’t resist!
The name AdaBoost is short for Adaptive Boosting. In this case the adaptive part is introduced by iteration and weighting.
[CLICK] To start with all samples are weighted equally. The decision tree (stump) then identifies a parameter that can split the data into class A or class B; in this case triangles and squares.
Any samples that were misclassified are then upweighted, with those correctly classified being downweighted. These modified data are then presented to a new decision tree. Since the weights on the previously misclassified samples are now higher, they are more likely to be correctly classified. Now, it is important to point out here that we’re not multiplying the spectral data points by this weighting; we’re changing their relative importance to the algorithm.
Next the misclassified samples from this second iteration are upweighted, with the correctly classified samples being downweighted, and we go for a third iteration.
After three iterations we stop, we combine the iterations and produce the ‘outcome’ of that tree ‘set’.
So, by iterating, and biasing each iteration in favour of samples there were wrongly classified in previous steps, we produce a stronger classifier. This might not be a VERY strong classifier, but it will be used in combination with others in the overall algorithm.
As with the Random Forests approach, when we introduce test data, each tree (or tree set) gets a vote for whichever class it thinks that test sample should fall into. There are various metrics that can be used here, but the majority vote is the easiest to think about and easiest to apply.
So, now we have our problem, and two potential algorithms to apply, how well do they work when presented with unbalanced data?
To assess this we used a tissue microarray containing breast cancer tissue from 208 patients. We selected 40 cores relating to cancer and 10 relating to normal associated tissue. Normal associated tissue is tissue from regions adjacent to a tumour from non-malignant cores. You don’t usually get access to healthy tissue. After all, most people don’t want to have a biopsy unless there is some VERY GOOD underlying medical reason!
We manually annotated these tissues, according to W.H.O. guidelines, and identified regions corresponding to cancerous epithelium and normal associated epithelium. We also annotated normal and cancerous stroma, but those spectra were not included in this study.
So, the first sampling method we will take a look at is under-sampling.
In this method we identify the class with the fewest members and reduce all other classes to that number. This is simple to understand and to apply. The downside is that we tend to throw away lots of data. If the smallest class is much smaller than the others, we will end up discarding most of the data acquired. This has the knock-on effect of weakening the model because the data available for the training set will be a smaller sample of the acquired population.
The good thing about under-sampling is that all the spectra remain unique, there are no duplicates. The model will be unbiased, but will have a higher variance.
The opposite of under-sampling is, of course, over-sampling!
In this scenario we increase the numbers in each of the minority classes to match the class with the most members. This will increase the size of the training set, which could be problematic for the target algorithm or computational resource available.
The biggest problem, however, comes when we have to decide on where these increased numbers will come from.
There are lots of methods we can choose to over-sample our data. Here I’ve listed four.
The first simply takes a copy of the smaller class and appends it to itself. We can repeat this until we reach the size of the larger class. Of course we will never get an exact match, well pretty unlikely anyway, so we need a method of dealing with the over/under hang. We can simply ignore this and say our classes are now much more similar, or we can use some form of randomisation to get the exact number.
This has the benefit of each spectrum in the minority class being equally represented in the newly generated group; well without taking into account the randomness if that’s the way we want to go.
And, of course, there are other approaches we could take.
The second approach uses something like a Bootstrap sampling approach, which is ‘sampling with replacement’, to randomly re-generate the minority class. Bootstrap has low bias and variance, but there could be samples, that never actually get selected. That means we are throwing away some original data.
Method three is similar to method two, except we ensure all the minority class are included and only Bootstrap the required difference.
Then there is the option of changing the data. The first three methods simply selected (or didn’t select) the spectra in the minority class. Another approach is to interpolate some of the spectra to generate data that was never actually acquired. One of these methods is called SMOTE and is discussed in a paper by Blagus and Lusa.
However, in this work we decided to go with method 3. This has the advantage of ensuring all the data acquired, relating to the minority class, are actually included in the training set, and any duplication being handled by the well-respected Bootstrap method.
So how did we get on?
First I should mention that the same independent test set was used in all cases. In addition we tried as much as possible to create training sets that were built by either expanding or contracting existing training sets, rather than generating each one randomly. This has the advantage of showing the variation in having larger or smaller data sets, rather than new ones created randomly. If we were to create lots of random data sets, some trends might be hidden.
In all cases the exercise of generating training sets and testing them was repeated 5 times. But with the same independent test set used in each case.
So, it’s useful to get some ground truth, so we know whether any changes we see as a function of sampling, are actually due to the change in the size of the training sets themselves.
We created balanced sets of different size from 2,500 per class, down to 10. As you can see both algorithms perform surprisingly well. It’s not until we get down to 100 samples per class that AdaBoost starts to fall over. At this point all samples are being classified as normal-associated. However, when we have large numbers per class it performs a little better than Random Forests. Although, you have to say that classification accuracies of 90% and over, is really rather good: it is worth pointing out here that all these data are generated from the same TMA, so accuracies of this level will probably not be maintained across different samples, instruments etc. However, using the same sample has the benefit of removing these additional sources of error, so we can concentrate on the performance of the algorithms themselves, and the sampling methods.
On the right, we can see that the Random Forests method stays pretty strong beyond 100 samples, and can even generate a reasonable result with only 10 samples per class!
So, taking a closer view of the left hand side of that plot, we generated some under-sampled training data. Each of these training sets has the same number of cancer and normal associated spectra, but as the size of the minority set gets smaller, you can see we end up throwing away lots of the majority class to match.
AdaBoost appears to out-perform Random Forests with the normal-associated tissue being almost perfectly classified for all samples sizes. Although to be fair, they both do pretty well. The cancer samples do not perform quite as well, so more are being misclassified as the training set gets smaller. The variability in the Random Forests data is slightly larger too.
Over-sampling is a bit more complicated. The red box in the table on the right indicates the spectra that are unique. That includes all the cancer spectra and normal-associated spectra originally in the samples. In order to over-sample we randomly duplicate more and more of the normal-associated, to keep up with the growing cancer data set. The dark blue squares labelled - D - represent duplicates, while the light blue squares labelled - U - represent the original spectra. As you can see, by the time we have a ratio of 9 to 1 we have 4,500 cancer spectra, each of which is unique, but only 500 unique normal-associated spectra. From these 500 we now need to randomly select another 4,000 spectra.
So, how does this duplication affect the outcome? Well, the AdaBoost method still seems to perform strongly. Note that the two lines cross over when our ratio is very large. This is probably due to the duplication in the normal-associated data leading to overfitting and that being reflected in its inability to correctly classify the test data.
The Random Forests method performs less well, and appears to be more influenced by the duplication than AdaBoost.
It’s worth taking a moment to compare the two sampling methods, using the same algorithm. With AdaBoost it looks like over-sampling works best and the level of classification accuracy remains fairly constant as the sample sizes change.
However, with Random Forests we get a different answer. Note how under-sampling improves the normal-associated accuracy, while the cancer samples become less well classified. However, with over-sampling we get the opposite effect. The cancer samples get better, but the normal-associated fall away.
This is worrying because it means we could get a different answer depending on the choice of algorithm AND the choice of sampling method.
So, what did we learn from doing this work?
Firstly, on this, admittedly, limited, data set, we can see that infrared does a good job of classifying cancer from non-cancer data. We have been discussing values in the 80-95% accuracy range, and, even allowing, for the use of a single instrument and a single TMA, this is an indication that IR is useful here.
-However, we need to be careful in our choice of algorithm and sampling method because our results could be misleading.
-AdaBoost seems to be slightly better at classification, and both AdaBoost and Random Forests will give good accuracy down to about 100 spectra per class (under-sampled). And, Random Forests remains relatively stable until we reach very small class sizes; in the 10s
-AdaBoost seems to be stable to over-sampling, while Random Forests is only stable for ranges that are relatively close; down to about 70:30.
Coming back to our original question, for unbalanced classes, will AdaBoost come to the rescue?
Well, I think the jury is still out. However, I think AdaBoost IS a contender, and we should do more work in this area to see how useful it can be