The document discusses machine learning methods for analyzing spectroscopic data and classifying samples. It describes how machine learning techniques like random forests are data-driven and do not require assumptions about the distribution of the data, unlike classical statistical analyses. The document provides examples of how random forest classification was able to correctly classify over 88% of test spectra from SIMS images of amino acid-coated beads and how a random forest model classified a large FTIR image of cancer tissue in under 60 seconds. In conclusion, machine learning methods seem useful for spectroscopic data analysis and increased computing power will allow their broader application.
Extending superlearner framework to survival analysis. Includes boosted regression, random forest, decision trees, Bayesian model average, and Morse-Smale regression.
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
Concepts include decision tree with its examples. Measures used for splitting in decision tree like gini index, entropy, information gain, pros and cons, validation. Basics of random forests with its example and uses.
The presentation includes preliminary information about the big data mainly metagenomic data and discussions related to the hurdles in analyzing using conventional approaches. In the later part, brief introduction about machine learning approaches using biological example for each. In the last, work done with special focus on implementation of a machine learning approach Random Forest for the functional annotation and taxonomic classification of metagenomic data.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
Prote-OMIC Data Analysis and VisualizationDmitry Grapov
Introductory lecture to multivariate analysis of proteomic data.
Material from the UC Davis 2014 Proteomics Workshop.
See more at: http://sourceforge.net/projects/teachingdemos/files/2014%20UC%20Davis%20Proteomics%20Workshop/
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
FAIRSpectra - Towards a common data file format for SIMS imagesAlex Henderson
Presentation from the 101st IUVSTA Workshop on High performance SIMS instrumentation and machine learning / artificial intelligence methods for complex data.
This presentation describes the issues relating to storing and sharing data from Secondary Ion Mass Spectrometry experiments, and some potential solutions.
Extending superlearner framework to survival analysis. Includes boosted regression, random forest, decision trees, Bayesian model average, and Morse-Smale regression.
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
Concepts include decision tree with its examples. Measures used for splitting in decision tree like gini index, entropy, information gain, pros and cons, validation. Basics of random forests with its example and uses.
The presentation includes preliminary information about the big data mainly metagenomic data and discussions related to the hurdles in analyzing using conventional approaches. In the later part, brief introduction about machine learning approaches using biological example for each. In the last, work done with special focus on implementation of a machine learning approach Random Forest for the functional annotation and taxonomic classification of metagenomic data.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
Prote-OMIC Data Analysis and VisualizationDmitry Grapov
Introductory lecture to multivariate analysis of proteomic data.
Material from the UC Davis 2014 Proteomics Workshop.
See more at: http://sourceforge.net/projects/teachingdemos/files/2014%20UC%20Davis%20Proteomics%20Workshop/
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
FAIRSpectra - Towards a common data file format for SIMS imagesAlex Henderson
Presentation from the 101st IUVSTA Workshop on High performance SIMS instrumentation and machine learning / artificial intelligence methods for complex data.
This presentation describes the issues relating to storing and sharing data from Secondary Ion Mass Spectrometry experiments, and some potential solutions.
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
Presentation from the "Opening up Research" conference organised by the University of Manchester's Office for Open Research.
24 April 2024
https://www.openresearch.manchester.ac.uk/
https://fairspectra.net
https://alexhenderson.info
The Class Imbalance Problem: AdaBoost to the Rescue?Alex Henderson
Here we discuss the issues when applying Random Forests and AdaBoost data analysis methods to infrared spectroscopy data sets, where the numbers in each class vary.
Invited presentation at the 11th International Conference on Advanced Vibrational Spectroscopy (ICAVS-11), 23-26 August 2021. This was a virtual conference.
This presentation relates to our paper in Analyst "Exploring AdaBoost and Random Forests machine learning approaches for infrared pathology on unbalanced data sets" by Jiayi Tang, Alex Henderson and Peter Gardner.
Paper: https://doi.org/10.1039/D0AN02155E (available open access, CC-BY).
Raw data: https://doi.org/10.5281/zenodo.4986399 (CC-BY)
Processed data, and MATLAB source code: https://doi.org/10.5281/zenodo.4730312 (CC-BY)
Abstract
The use of infrared spectroscopy to augment decision-making in histopathology is a promising direction for the diagnosis of many disease types. Hyperspectral images of healthy and diseased tissue, generated by infrared spectroscopy, are used to build chemometric models that can provide objective metrics of disease state. It is important to build robust and stable models to provide confidence to the end user. The data used to develop such models can have a variety of characteristics which can pose problems to many model-building approaches. Here we have compared the performance of two machine learning algorithms – AdaBoost and Random Forests – on a variety of non-uniform data sets. Using samples of breast cancer tissue, we devised a range of training data capable of describing the problem space. Models were constructed from these training sets and their characteristics compared. In terms of separating infrared spectra of cancerous epithelium tissue from normal-associated tissue on the tissue microarray, both AdaBoost and Random Forests algorithms were shown to give excellent classification performance (over 95% accuracy) in this study. AdaBoost models were more robust when datasets with large imbalance were provided. The outcomes of this work are a measure of classification accuracy as a function of training data available, and a clear recommendation for choice of machine learning approach.
Design decisions relating to ChiToolbox, presented at the Kick-off Meeting for OpenVibSpec, 3 February 2020, in Bochum, Germany.
ChiToolbox is an open source MATLAB toolbox for handling data from hyperspectral imaging experiments.
https://bitbucket.org/AlexHenderson/ChiToolbox/
https://openvibspec.org/
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopyAlex Henderson
Presentation given at SPEC 2014, Krakow, Poland. 17-22 August 2014
[some slides do not display correctly, download the pdf for better quality]
In our day-to-day practice we collect data, convert this to information, hopefully extract knowledge, and then pass this on to our peers, thereby advancing the global understanding of our field. This is a very linear process. What if we were to share our data? Have others take our information and combine it with their own? Such a branched process would likely result in more rapid discoveries and, potentially, a greater understanding. In order to facilitate data sharing we must define at least two interfaces with our peers;
1. A mechanism of them understanding the language of our data
2. A mechanism of passing on the context of our experiment
Of course, both of these must work in reverse; we must understand their data and also their experimental context. These are separate yet related ideas. Our data are meaningless without context, but because we are ‘close to the action’ we do not explicitly document them.
Recording the nature of our experiments can have benefits closer to home. Too often we find ourselves searching for results that we know we recorded, but have difficulty locating. Then there is the issue of recalling the exact experimental procedure involved in the sample preparation or data reduction. Documentation of these will lead to better laboratory practice all round.
Earlier this year, a network of academic, clinical and industrial groups was constituted in the UK, with some international partners, to consider how best to push forward the use of infrared and Raman spectroscopies in the clinical arena: CLIRSPEC [1]. One of the work packages of the CLIRSPEC network is the development of standard protocols for data sharing. The work package falls, initially, into two parts;
1. How to easily and uniformly transfer our data between research teams and, by association, into an accessible archive.
2. How to record the provenance of our samples, the treatments they undergo, the experiments performed on them and the manner the resulting data was manipulated: the metadata.
In this presentation we will outline the current position of the CLIRSPEC work package, both in terms of the performance of various candidate data formats (JCAMP-DX, SPC, netCDF, …), and the options for the recording of the metadata associated with the experimental procedure (controlled vocabularies, XML, RDF, ISA-TAB, …). Included here is the concept of a minimum reporting requirement for IR and Raman, particularly in the clinical arena, that we can all try to meet.
None of this can happen without the buy-in of the community. We seek to engage everyone in a dialogue that will result in more consistent, and hopefully better, practice across all laboratories to further our understanding of clinical vibrational spectroscopy.
[1] http://clirspec.org
This presentation covers factors that influence the form of a Static SIMS spectrum and various issues that may arise in its interpretation.
Presented at the Joint IAEA-SPIRIT-Japan Technical Meeting on Development and Utilization of MeV-SIMS. Inter-University Centre, Dubrovnik, Croatia. 21-25 May 2012
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis
1. RISE OF THE MACHINES
THE USE OF MACHINE LEARNING
IN SIMS DATA ANALYSIS
Alex Henderson
University of Manchester
SurfaceSpectra Ltd
http://about.me/henderson.alex
Twitter: @AlexHenderson00
4. QUESTIONS WE MIGHT ASK
• Exploratory data analysis
• What can we find out about these samples?
• No prior knowledge required
• Differences in chemical or physical state between groups of samples
• Highlights spectral changes as function of group membership
• Need to know which group each spectrum belongs to
• Trend analysis
• Spectral changes as a function of dependent variable:
time, concentration, disease state etc.
• Classification of samples
• Spectral characteristics of groups
• Prediction of unseen samples into known groups
5. DATA ANALYSIS APPROACHES
CLASSICAL ANALYSIS
Hypothesis driven
Assumes a distribution
of spectral response
MACHINE LEARNING
Data driven
Interrogation of data
leads to hypothesis
Validation always required when
building a predictive model
6. CLASSICAL ANALYSIS
Assumes data obey the Central Limit Theorem
Data is Normally distributed
(Gaussian or Bell shaped curve)
Mathematically we can derive 4 ‘moments’
• Mean (average)
• Variance (standard deviation)
• Kurtosis (pointedness)
• Skewness (asymmetry)
Other descriptions lead from these parameters
eg Student’s t-test, ANOVA, MANOVA
8. HISTORY OF MVA
Classical multivariate analysis dates from 1930s
Harold Hotelling, Ronald Fisher, Herman Wold and others
• Principal components analysis (PCA)
• Partial least squares (PLS)
• Fisher’s discriminant analysis
• Linear discriminant analysis (LDA), etc.
Slide rule is King!
9. HISTORY OF MVA CONTINUED
Computers become generally available in 1950s
Speed and reproducibility of calculations becomes easier
New approaches are developed
Term ‘Machine Learning’ coined in 1990’s, describing a
branch of computer science
13. WHAT FITS WHERE?
Classical Analysis Machine Learning
Exploratory
analysis
PCA K-means, HCA
Differences in state
between groups of
samples
Discriminant analysis
LDA, QDA, CVA
Random Forest
Classification
Trend analysis Regression analysis,
MCR
Random Forest
Regression
Classification of
samples
LDA, QDA Random Forest
Classification
SVM
15. RANDOM FOREST
Ensemble method
Collect lots of weak classifiers to build one strong one
Collection of Decision Trees
Computationally intensive
Developed 1995 – 2001
MATLAB: TreeBagger
Python: scikit.RandomForestClassifier
16. DECISION TREE
An expression of an algorithm
Weak classifier
Move through each step in turn
Boss
around?
Weather?
Beer
Pay
day?
Beer
Work
Work
WorkYes
No
Sunny
Rainy
Windy
Recent
Long ago
17. ENSEMBLE OF TREES
Randomly select subsets of variables (m/z intensities)
Train multiple (few hundred) decision trees, each with
different variables
Each tree does the
best it can with only
a portion of the data
See which trees are
best and weight them
higher
18. VARIABLE ZOO
Ratio measurements taken
for many animals
For example:
• Length of leg
• Number of legs
• Number of wings
• Has horns/antlers?
• Length of neck
• Length of tail
Many examples of each
animal used
No tree gets all measurements
19. TOO EASY?
The giraffe is easily recognised by the number of legs and
length of neck. Oh, and it’s not a bird…
If any tree has those variables it would always identify the
animal as a giraffe. No need for anything else.
20. A Gerenuk is a four-legged
mammal with a long neck.
The decision tree was good,
but not good enough
It needed to be tamed by
other trees
The Random Forest model
prevents some trees from
dominating the overall result
WRONG!
22. Polystrene beads
Each bead coated with a
different amino acid
SIMS image using Biotof
256 × 256 pixels
1000 amu
bin-summed to 1 amu
Data courtesy of Nick
Winograd, Penn State
University, USA, ~1999
CLASSIFICATION
EXAMPLE
23. Two regions on each
bead and also the
substrate selected
One region to train, the
other to test
Each region 400 pixels
Square root taken
Vector normalised
TRAINING AND TEST
REGIONS
Test
Test
Test
24. RANDOM FOREST MODEL
Training data (3 × 400 spectra)
Each spectrum labelled: bead1, bead 2, or substrate
Random Forest model constructed using
scikit.RandomForestClassifier in Python 3.5
300 trees selected.
Other parameters left as default
Code executed in PyCharm 2017.1.2
25. Bead 1 Bead 2 Substrate
Bead 1 97.5 1.5 1.0
Bead 2 3.0 96.0 1.0
Substrate 8.3 3.3 88.5
Percentage of correctly predicted values
Diagonal (trace) indicates > 88% of test spectra correctly classified
Caution: Result should be verified by cross-validation or bootstrap
CONFUSION MATRIX
Truth
Prediction
26. Each decision tree
uses different
combination of mass
values
Determine which m/z
values were used by
the most accurate
trees
This is a measure of
the importance of
those variables: m/z
VARIABLE IMPORTANCE
27. Trained Random
Forest model used
to predict class for
each pixel in
original image
Render the image
using result of
classification
Total time 15 sec
PREDICTION OF
ENTIRE IMAGE
28. FTIR – CANCER TISSUE
Epithelium 24.3%
Smooth muscle 50.7%
Lymphocytes 2.5%
Blood 0.2%
Concretion 0.0%
Fibrous stroma 12.3%
ECM 10.0%
Random forest classifier
Trained on exemplars with pathologist
6 hour data acquisition
2.5 million spectra classified in < 60 sec
No staining required, or de-waxing of the sample
Proc. SPIE 9041, Medical Imaging 2014: Digital Pathology, 90410D doi:10.1117/12.2043290
29. SUMMARY
Machine Learning
methods appear to be
useful tools that we
should consider for
adoption
Unsupervised,
supervised classification
and supervised
regression options are
all available
Increased computer
power may be required,
but Moore’s Law is on
our side here
30. IMAGE CREDITS
Mechagodzilla: http://list25.com/25-famous-fictional-robots-history/
Simply explained: http://geekandpoke.typepad.com/geekandpoke/2012/01/simply-explained-dp.html
Slide rule: https://commons.wikimedia.org/wiki/File:Slide_rule_scales_back.jpg
Brave new world: https://commons.wikimedia.org/wiki/File:IBM_150_Extra_Engineers_1951.jpg
Mechanical Turk 1: https://commons.wikimedia.org/wiki/File:Tuerkischer_schachspieler_windisch4.jpg
Mechanical Turk 2: https://commons.wikimedia.org/wiki/File:Tuerkischer_schachspieler_racknitz3.jpg
Scikit-learn cheat sheet: http://scikit-learn.org/stable/tutorial/machine_learning_map/
Forest: https://commons.wikimedia.org/wiki/File:Forest_Osaka_Japan.jpg
Animal silhouettes: https://clipartfest.com/categories/view/4c03d8ea8a4bc1ffca947c8b8dab48af25908403/african-
animal-silhouettes-clipart.html
Giraffe: https://img.clipartfest.com/4294d3fb2739e14cec3845ef668dcdc0_life-size-african-animal-wall-african-animal-
silhouettes-clipart_221-203.gif
Gerenuk 1: https://500px.com/cindy_wheeler
Gerenuk 2: http://wordwomanpartialellipsisofthesun.blogspot.co.uk/2015/01/gerenuk-giraffe-necked-gazelle-with.html
Many Gerenuks: http://wordwomanpartialellipsisofthesun.blogspot.co.uk/2015/01/gerenuk-giraffe-necked-gazelle-
with.html
Beads: Nick Winograd, Penn State University, USA
Cancer tissue: Proc. SPIE 9041, Medical Imaging 2014: Digital Pathology, 90410D doi:10.1117/12.2043290
Tetsujin 28: http://goldenani.blogspot.co.uk/2013/01/1963-part-1-on-outside-looking-in.html