This document discusses ways that machine learning algorithms for logistic regression can be improved by incorporating geometric and topological information. It presents several algorithms that do this, including differential geometry LASSO (DGLARS) and homotopy-based LASSO. Simulation results show these methods outperform traditional logistic regression and other algorithms on data with noise, overlap between groups, and nonlinear relationships. When applied to a breast cancer dataset, DGLARS and homotopy LASSO performed comparably to complex machine learning models while producing interpretable linear models.
A short tutorial on Morse functions and their use in modern data analysis for beginners. Uses visual examples and analogies to introduce topological concepts and algorithms.
Geometry, Data, and One Path Into Data Science.pptxColleen Farrelly
Women in Data Science (Alexandria, Egypt) keynote address. Topics cover my journey into data science/machine learning, an overview of data science as a profession, and some case studies on topology/geometry in analytics. Example case studies include insurance, natural language processing, social network analysis, and psychometrics.
Process of converting data set having vast dimensions into data set with lesser dimensions ensuring that it conveys similar information concisely.
Concept
R code
A short tutorial on Morse functions and their use in modern data analysis for beginners. Uses visual examples and analogies to introduce topological concepts and algorithms.
Geometry, Data, and One Path Into Data Science.pptxColleen Farrelly
Women in Data Science (Alexandria, Egypt) keynote address. Topics cover my journey into data science/machine learning, an overview of data science as a profession, and some case studies on topology/geometry in analytics. Example case studies include insurance, natural language processing, social network analysis, and psychometrics.
Process of converting data set having vast dimensions into data set with lesser dimensions ensuring that it conveys similar information concisely.
Concept
R code
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data
Extension of this method exists in recent paper here: https://arxiv.org/ftp/arxiv/papers/1708/1708.05712.pdf
Overview and tutorial of Morse-Smale regression prior to a new paper coming out exploring this idea further. It is a topologically-based piecewise regression method for supervised learning.
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
Radial Basis Functions are nonlinear activation functions used by artificial neural networks.Explained commonly used RBFs ,cover's theorem,interpolation problem and learning strategies.
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Weiyang Tong
This paper makes important advancements to a Particle Swarm Optimization (PSO) algorithm that seeks to address the major complex attributes of engineering optimization problems, namely multiple objectives, high nonlinearity, high dimensionality, constraints, and mixed-discrete variables. To introduce these capabilities while keeping PSO competitive with other powerful multi-objective algorithms (e.g., NSGA-II, SPEA, and PAES), it is important to not only preserve population diversity (for mitigating stagnation), but also explicit diversity preservation to facilitate improved converge of (non-convex) Pareto frontiers. A new multi-domain preservation technique is presented in this paper for this purpose. In this technique, an adoptive repulsion is applied to each global leader to slow down the clustering of particles overly popular global leaders, and maintain a desirably even distribution of Pareto optimal solutions. In addition, the global leader selection is now modified to follow a stochastic solution based on a half Gaussian distribution. Specifically, two different population diversity measures are explored: (i) based on the smallest hypercube enclosing the entire population, and (ii) based on the smallest hypercube enclosing the subset of particles following each of the global leaders. Both strategies are investigated using a suite of benchmark problems. The performance of the new PSO algorithm is compared with other algorithms in terms of convergence measure, uniformity measure, and computation time.
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data
Extension of this method exists in recent paper here: https://arxiv.org/ftp/arxiv/papers/1708/1708.05712.pdf
Overview and tutorial of Morse-Smale regression prior to a new paper coming out exploring this idea further. It is a topologically-based piecewise regression method for supervised learning.
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
Radial Basis Functions are nonlinear activation functions used by artificial neural networks.Explained commonly used RBFs ,cover's theorem,interpolation problem and learning strategies.
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Weiyang Tong
This paper makes important advancements to a Particle Swarm Optimization (PSO) algorithm that seeks to address the major complex attributes of engineering optimization problems, namely multiple objectives, high nonlinearity, high dimensionality, constraints, and mixed-discrete variables. To introduce these capabilities while keeping PSO competitive with other powerful multi-objective algorithms (e.g., NSGA-II, SPEA, and PAES), it is important to not only preserve population diversity (for mitigating stagnation), but also explicit diversity preservation to facilitate improved converge of (non-convex) Pareto frontiers. A new multi-domain preservation technique is presented in this paper for this purpose. In this technique, an adoptive repulsion is applied to each global leader to slow down the clustering of particles overly popular global leaders, and maintain a desirably even distribution of Pareto optimal solutions. In addition, the global leader selection is now modified to follow a stochastic solution based on a half Gaussian distribution. Specifically, two different population diversity measures are explored: (i) based on the smallest hypercube enclosing the entire population, and (ii) based on the smallest hypercube enclosing the subset of particles following each of the global leaders. Both strategies are investigated using a suite of benchmark problems. The performance of the new PSO algorithm is compared with other algorithms in terms of convergence measure, uniformity measure, and computation time.
Extending superlearner framework to survival analysis. Includes boosted regression, random forest, decision trees, Bayesian model average, and Morse-Smale regression.
Presents a new type of statistical model developed by Quantopo, LLC, based on generalized linear modeling and Tweedie regression that leverages the power of quantum computing. Paper is being written and will be uploaded to arXiv while under review.
AUC is and has been an extremely powerful lens through which machine learning practitioners have been able to evaluate and compare model performance. Is the phrase “my curve is better than your curve” the right threshold for publishing a new paper or pushing a new model into production? In this talk, I will demonstrate the ways in which we at Remitly are thinking outside the box (and the area under the curve) to challenge whether or not AUC is the right metric for a range of applications. Price and cost are fundamental components of economic modeling, and are quintessential aspects of an economist’s education and economic way of thinking. These are foreign concepts for many machine learning practitioners. Remitly’s Data Science team manages and thinks deeply about a number of classification tasks such as risk management and fraud detection. For a number of these tasks, misclassification is extremely costly compared to the gains of a correct classification. We are willing to sacrifice AUC in order to incorporate costs of classification and misclassification into our loss functions. By incorporating the notion of “indifference curves” (i.e., level sets), we show that by choosing models whose ROC curves cross our indifference curve thresholds, we can aim for models that give us the best bang for our buck.
Deep vs diverse architectures for classification problemsColleen Farrelly
Deep learning study, comparing deep learning methods with wide learning methods; applications include simulation data and real industry problems. Pre-print of paper found here: https://arxiv.org/ftp/arxiv/papers/1708/1708.06347.pdf
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
Generalized linear models (GLMs) and gradient boosting machines (GBMs) are two of the most widely used supervised learning approaches in all of commercial data science. GLMs have been the go-to predictive and inferential modeling tool for decades, but important mathematical and computational advances have been made in training GLMs in recent years. This talk will contrast H2O’s implementation of penalized GLM techniques with ordinary least squares and give specific hints for building regularized and accurate GLMs for both predictive and inferential purposes. As more organizations begin experimenting with and embracing algorithms from the machine learning tradition, GBMs have come to prominence due to their predictive accuracy, the ability to train on real-world data, and resistance to overfitting training data. This talk will give some background on the GBM approach, some insight into the H2O implementation, and some tips for tuning and interpreting GBMs in H2O.
Patrick's Bio:
Patrick Hall is a senior data scientist and product engineer at H2O.ai. Patrick works with H2O.ai customers to derive substantive business value from machine learning technologies. His product work at H2O.ai focuses on two important aspects of applied machine learning, model interpretability and model deployment. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning.
Prior to joining H2O.ai, Patrick held global customer facing roles and R & D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the 11th person worldwide to become a Cloudera certified data scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.
Performance analysis of regularized linear regression models for oxazolines a...ijcsity
Regularized regression technique
s for lin
ear regression have been creat
ed
the last few
ten
year
s to
reduce
the
flaws
of ordinary least squ
ares regression
with regard to prediction accuracy.
In this paper, new
methods
for using regularized regression in model
choice are introduc
ed, and we
distinguish
the condition
s
in whic
h regularized regression develops
our ability to discriminate models.
W
e applied all the five
methods that use penalty
-
based (regularization) shrinkage to
handle Oxazolines and Oxazoles derivatives
descriptor
dataset
with far more predictors than observations.
The lasso,
ridge,
elasticnet,
lars and relaxed
lasso
further pos
sess the desirable property that they simultaneously sele
ct relevant predictive descriptor
s
and optimally
estimate their effects.
Here, we comparatively evaluate the performance of five regularized
linear regression methods
The assessment of the performanc
e of each model by means of benchmark
experiments
is an
established exercise.
Cross
-
validation and
resampling
method
s
are genera
lly
used to
arrive
point
evaluates the efficienci
es which are compared
to recognize
methods
with acceptable feature
s.
Predictiv
e accuracy
was evaluated
us
ing the root mean squared error
(RMSE)
and
Square of usual
correlation between predictors and observed mean inhibitory concentration of antitubercular activity
(R
square)
.
We found that all five regularized regression models were
able to produce feasible models
and
efficient capturing the linearity in the data
.
The elastic net and lars had similar accuracies
as well as lasso
and relaxed lasso
had similar accuracies
but outperformed ridge regression in terms of the RMSE and R
squ
are
metrics.
Presentation summarizes main content of Farrelly, C. M. (2017). Extensions of Morse-Smale Regression with Application to Actuarial Science. arXiv preprint arXiv:1708.05712.
Paper was accepted December 2017 by Casualty Actuarial Society.
Vahid Taslimitehrani's Dissertation Defense: Friday, February 19 2015.
Ph.D. Committee: Drs. Guozhu Dong, Advisor, T.K. Prasad, Amit Sheth, Keke Chen
and Jyotishman Pathak, Division of Health Informatics, Weill Cornell Medical College, Cornell University.
ABSTRACT:
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
Similar to Logistic regression: topological and geometric considerations (20)
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
A brief overview of generative AI technologies and their use for social good initiatives, including cultural training, medical image generation, drug design, and public health.
PyData Global 2023 talk overviewing case studies in network science, including stock market crash prediction, food price pattern mining, and stopping the spread of epidemics.
Overview of mathematical and machine learning models related to climate risk modeling, climate change simulations, and change point detection. Includes a hands-on session with geometry-based systems analysis of food prices related to climate change and geopolitical factors.
WiDS Workshop on natural language processing and generative AI. Details common methods that tie into coding examples. Ends with ethics discussion regarding these technologies and potential for misuse.
Link to talk YouTube: https://www.youtube.com/watch?v=byGzKm0H1-8&list=PLHAk3jHXWpxI7fHw8m5PhrpSRpR3NIjQo&index=3
ODSC-East 2023 presentation covering topics related to my book, The Shape of Data, including how geometry plays a role in text/image embeddings, network science problems, survey data analytics, image analytics, and epidemic wrangling.
This talk overviews my background as a female data scientist, introduces many types of generative AI, discusses potential use cases, highlights the need for representation in generative AI, and showcases a few tools that currently exist.
Emerging Technologies for Public Health in Remote Locations.pptxColleen Farrelly
The tools possible to leverage for public health interventions has changed significantly in the past decades. Tools from geometry, natural language processing, and generative AI allow for a quick design and implementation of interventions, even in very rural parts of the world. Case studies involve HIV, Ebola, and COVID interventions.
WoComToQC workshop lecture on Forman-Ricci curvature for applications in industry (social networks, disaster logistics, spatial data, and spatiotemporal goods pricing data).
PyData Global talk covering tools from geometry/topology and their uses in public health, public policy, and social good initiatives. Examples include food price prediction, COVID policies, public health interventions, and fair AI.
Data Science Dojo Talk on comparing time series using persistent homology. Short overview of time series data. A bit of topology. Code available. Example includes stock exchange data.
Statistical and topological algorithm piece of an Applied Machine Learning Days Morocco talk. Covers ARIMA models, SSA models, GEE models, and persistent homology. Applications include pricing data, stock data, development data, and healthcare data. Datasets and full presentation can be found on GitHub: https://github.com/gabayae/Time-Series-Applications_AMLD2022
An introduction to quantum machine learning.pptxColleen Farrelly
Very basic introduction to quantum computing given at Indaba Malawi 2022. Overviews some basic hardware in classical and quantum computing, as well as a few quantum machine learning algorithms in use today. Resources for self-study provided.
Indaba Malawi workshop on basic approaches to time series data, including ARIMA models and SSA models. Example in R includes an agricultural example from historical Malawi data with Rssa package and base ARIMA models.
NLP: Challenges and Opportunities in Underserved AreasColleen Farrelly
This talk highlights the challenges and opportunities that exist in linguistically underserved areas. It highlights NLP initiatives in Sub-Saharan Africa, as well as financial opportunities in technology if areas neglected linguistically can produce tools in their local languages. Ethics, ownership, and other concerns are highlighted to guide development initiatives.
WiDS Alexandria, Egypt workshop in topological data analysis (Python and R code available on request), covering persistent homology, the Mapper algorithm, and discrete Ricci curvature. Examples include text data and social network data.
First part of a workshop looking at industry case studies in natural language processing for From Theory to Practice Workshop (AIMS, Kigali, March 2022).
SAS Global 2021 Introduction to Natural Language Processing Colleen Farrelly
Overview of text data, processing of text data, integration of text data with structured databases, and uses of text data in analytics across a variety of fields. Here's the talk link: https://www.youtube.com/watch?v=wS0X1bSsuUU
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
2. PROBLEM OVERVIEW
•Logistic regression is ubiquitous in medical research today.
• Recovery from a traumatic brain injury
• Psychiatric outcomes/relapse
• Development of resistance to HIV medications
•Logistic regression struggles under several types of data conditions:
• Small sample size relative to number of predictors (n<p)
• Sparsity (few predictors are related to outcome)
• Collinearity (variables share lots of variance)
Machine learning offers several promising solutions:
Penalized regression
Tree-based methods
Boosted regression
Neural networks
Some return an interpretable model with regression coefficients (odds ratios); some do not.
3. MACHINE LEARNING EXTENSIONS
•Despite the success of machine learning in
recent years, algorithm extensions utilizing
topological or geometric information have
outperformed their base algorithm
counterparts.
• Network analytics (Forman curvature) and network
matching via topologically-based metrics
• Hodge theory extensions of graph-based ranking
algorithms
• Psychometric measure validation via Hausdorff
statistics
• Morse-Smale-based regression
• Mapper algorithm for clustering/subgroup mining
• Persistent homology to explore group
differences/shape matching problems
• Conformal mapping for image analytics
•This suggests that machine learning
algorithms, such as penalized regression, can
benefit from the addition of geometric and
topological information.
4. •Penalized extensions of generalized
linear models such as logistic
regression:
• LASSO imposes sparsity by adding a penalty
(L-1 norm) that shrinks near-zero
coefficients to 0.
• Similar to a cowboy at the origin roping coefficients that
get too close
• Can handle p>n situations well
• Ridge regression adds a different penalty (L-
2 norm) to create a robust model.
• Handles messy geometry (local minima/maxima…) to
yield consistent estimators
• May not impose sparsity on solutions
• Elastic net combines these penalties to
impose sparsity on solutions and yield a
robust model.
LASSO, RIDGE REGRESSION, AND
ELASTIC NET
5. HOMOTOPY-BASED LASSO
• Homotopy arrow example
• Red and blue arrows can be deformed
into each other by wiggling and
stretching the line path with anchors
at start and finish of line.
• Yellow arrow crosses holes and would
need to backtrack or break to the
surface to freely wiggle into the blue
or red line.
• Homotopy method in LASSO
wiggles an easy regression path
into an optimal regression path.
• Avoids obstacles that can trap other
regression estimators (peaks, valleys,
saddles…)
• Akin to removing obstacles that might
hinder the cowboy’s ability to rope
variables near the origin
• Homotopy as path
equivalence
• Intrinsic property of
topological spaces (such
as data manifolds)
6. •Instead of fitting model to data space, fit model to model error tangent
space:
• Deals with collinearity, as parallel vectors share a tangent space (only one selected of
collinear group)
• Separates predictors into 3 groups:
• Set of selected predictors (small angles relative to error tangent space)
• Set of redundant predictors (share tangent space with predictors)
• Set of non-selected predictors (large angles with tangent space)
•Leverages Rao scoring and Fisher Information
• Important in assessing generalized linear models
• Yields model fit statistics (BIC, AIC…)
• Forward selection for series of all possible models
• Choose best model by model fit statistics
DIFFERENTIAL GEOMETRY LARS
EXTENSIONS (DGLARS)
7. TESTING SET-UP: ALGORITHMS
•Set 1
• Main effects models that yield regression
coefficients for predictors
• Logistic regression
• Elastic net regression
• Homotopy LASSO
• DGLARS
• Boosted regression with linear base learners
• Multivariate adaptive regression splines (MARS)
• Bayesian model averaging (BMA)
•Set 2
• Main effects plus interaction term models
that yield regression coefficients
• Logistic regression
• Homotopy LASSO
• Boosted regression with linear base learners
• MARS
•Set 3
Other machine learning models which
do not yield regression coefficients or
an interpretable model
Random forest
Extreme gradient boosting (XGBoost)
Conditional inference tree
Neural network
K-nearest neighbors regression
8. SIMULATIONS AND REAL DATA
•Simulations:
• Comparison of sets 1 and 2
• 13 predictors (4 true predictors) and
binary outcome (0,1)
• Linear relationships (4 main effects terms)
• Nonlinear relationships (2 interaction terms)
• Mixed relationships (2 main effects, 1
interaction term)
• Added Gaussian noise and group
overlap levels:
• Low noise (0,0.25) and no group overlap
• Medium noise (0,0.5) and 5-10% group overlap
• High noise (0,0.75) and 15-20% group overlap
• Yielded a total of 9 simulation
conditions, which were replicated 10
times across sample sizes of: 500,
1000, 2500, 5000, 10000.
• Train/test splits of 70/30 for each trial
•Real dataset:
• UCI Machine Learning Repository
Wisconsin Breast Cancer Dataset
(WBCD)
• 569 individuals with 30 tumor attributes and
binary indicator of malignancy (outcome)
• Sets 1, 2, and 3 compared
• 70/30 train/test split
• Comparison of selected model
coefficients across set 1:
• Reduction of model size
• Odds ratio comparison of selected terms
• Overlap of selected terms between models
9. SIMULATION RESULTS
•Main effects trials (left column)
• Most algorithms perform well.
• Set 2 homotopy LASSO (stars) is optimal
among algorithms tested.
• DGLARS (triangles) performs well with low
noise/overlap or high noise/overlap.
•Interaction trials (middle column)
• Set 2 Homotopy LASSO performs well across
conditions (especially n>2500).
• DGLARS performs well with low
noise/overlap.
• DGLARS retains advantage over many main
effects models with added noise/overlap.
•Mixed trials (right column)
• DGLARS outperforms all other set 1/set 2
methods at low noise/overlap conditions
• DGLARS retains advantage over set 1/set 2
algorithms, except set 2 homotopy LASSO.
• Set 2 homotopy LASSO emerges as best
algorithm with increasing noise/overlap.
•This suggests that incorporating
geometry/topology into machine
learning algorithms can improve
performance on data with:
• Group overlap
• Noisy measurements
10. BCWD RESULTS: OVERVIEW OF
PERFORMANCE•Set 1 main effects models
perform well.
• Machine learning methods
improve logistic regression.
• Elastic net and homotopy
LASSO show the best
performance overall.
•Set 2 models suggest
logistic regression struggles
with the large number of
predictors relative to
sample size.
•Set 3 models demonstrate
that DGLARS and homotopy
LASSO perform comparably
well to nonparametric
machine learning models,
yielding lower and more
balanced error.
11. BCWD ODDS RATIO COMPARISON: SET
1 ALGORITHMS •Most algorithms
reduced the predictor
set by more than half.
•Many algorithms
struggled with data
geometry (singularities,
local optima…), yielding
odds ratios of >1000
(set to 10 in graph).
•Homotopy LASSO
offered a solution with
finite odds ratio
estimates/coefficients.
• Suggests its potential for
solving multivariate
regression on messy
datasets
• Yields interpretable,
bounded odds ratios when
other algorithms fail
12. CONCLUSIONS
•This study suggests the potential for new logistic regression
algorithms that incorporate geometric and topological information.
• DGLARS and homotopy LASSO perform well on simulated data, particularly on
messier problems with main effects and interaction terms with some noise/group
overlap.
• Homotopy LASSO and DGLARS perform well on BCWD compared to nonparametric
machine learning algorithms and produce interpretable linear models.
• Homotopy LASSO yields finite odds ratios where other regression algorithms fail.
•More work should be done to incorporate geometric/topological
methods into existing machine learning algorithms (particularly those
based on generalized linear regression with interpretable models).
•Further empirical testing could include:
• Multinomial regression (3+ category outcomes)
• Tweedie/Poisson regression (count outcomes)
Editor's Notes
Grant BF, Dawson DA. Age at onset of alcohol use and its association with DSM-IV alcohol abuse and dependence: results from the National Longitudinal Alcohol Epidemiologic Survey. Journal of substance abuse. 1997 Dec 31;9:103-10.
Andrews PJ, Sleeman DH, Statham PF, McQuatt A, Corruble V, Jones PA, Howells TP, Macmillan CS. Predicting recovery in patients suffering from traumatic brain injury by using admission variables and physiological data: a comparison between decision tree analysis and logistic regression. Journal of neurosurgery. 2002 Aug;97(2):326-36.
Pflueger MO, Franke I, Graf M, Hachtel H. Predicting general criminal recidivism in mentally disordered offenders using a random forest approach. BMC psychiatry. 2015 Mar 29;15(1):62.
Sinisi SE, Polley EC, Petersen ML, Rhee SY, van der Laan MJ. Super learning: an application to the prediction of HIV-1 drug resistance. Statistical applications in genetics and molecular biology. 2007 Jan 1;6(1).
Heidema AG, Boer JM, Nagelkerke N, Mariman EC, Feskens EJ. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC genetics. 2006 Apr 21;7(1):23.
Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010 Jan;20(1):101.
Weber M, Saucan E, Jost J. Characterizing Complex Networks with Forman-Ricci curvature and associated geometric flows. arXiv preprint arXiv:1607.08654. 2016 Jul 28.
Lee H, Ma Z, Wang Y, Chung MK. Topological Distances between Networks and Its Application to Brain Imaging. arXiv preprint arXiv:1701.04171. 2017 Jan 16.
Xu Q, Jiang T, Yao Y, Huang Q, Yan B, Lin W. Random partial paired comparison for subjective video quality assessment via HodgeRank. InProceedings of the 19th ACM international conference on Multimedia 2011 Nov 28 (pp. 393-402). ACM.
Wang Y, Shi J, Yin X, Gu X, Chan TF, Yau ST, Toga AW, Thompson PM. Brain surface conformal parameterization with the Ricci flow. IEEE transactions on medical imaging. 2012 Feb;31(2):251-64.
Gerber S, Rübel O, Bremer PT, Pascucci V, Whitaker RT. Morse–smale regression. Journal of Computational and Graphical Statistics. 2013 Jan 1;22(1):193-214.
Lum PY, Singh G, Lehman A, Ishkanov T, Vejdemo-Johansson M, Alagappan M, Carlsson J, Carlsson G. Extracting insights from the shape of complex data using topology. Scientific reports. 2013 Feb 7;3:1236.
Moon C, Giansiracusa N, Lazar N. Persistence Terrace for Topological Inference of Point Cloud Data. arXiv preprint arXiv:1705.02037. 2017 May 4.
Lee H, Ma Z, Wang Y, Chung MK. Topological Distances between Networks and Its Application to Brain Imaging. arXiv preprint arXiv:1701.04171. 2017 Jan 16.
Bendich P, Gasparovic E, Tralie CJ, Harer J. Scaffoldings and Spines: Organizing High-Dimensional Data Using Cover Trees, Local Principal Component Analysis, and Persistent Homology. arXiv preprint arXiv:1602.06245. 2016 Feb 19.
Farrelly CM, Schwartz SJ, Amodeo AL, Feaster DJ, Steinley DL, Meca A, Picariello S. The Analysis of Bridging Constructs with Hierarchical Clustering Methods: An application to identity. Journal of Research in Personality. 2017 Jun 29.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
Osborne MR, Presnell B, Turlach BA. A new approach to variable selection in least squares problems. IMA journal of numerical analysis. 2000 Jul 1;20(3):389-403.
Augugliaro, L., Mineo, A. M., & Wit, E. C. (2013). Differential geometric least angle regression: a differential geometric approach to sparse generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 471-498.
Augugliaro, L., & Mineo, A. M. (2015). Using the dglars Package to Estimate a Sparse Generalized Linear Model. In Advances in Statistical Models for Data Analysis (pp. 1-8). Springer International Publishing.
Raftery AE, Madigan D, Hoeting JA. Bayesian model averaging for linear regression models. Journal of the American Statistical Association. 1997 Mar 1;92(437):179-91.
Breiman L. Random forests. Machine learning. 2001 Oct 1;45(1):5-32.
Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics. 2006 Sep 1;15(3):651-74.
Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician. 1992 Aug 1;46(3):175-85.
Bebis G, Georgiopoulos M. Feed-forward neural networks. IEEE Potentials. 1994 Oct;13(4):27-31.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. InProceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016 Aug 13 (pp. 785-794). ACM.
Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics. 2000;28(2):337-407.
Friedman JH. Multivariate adaptive regression splines. The annals of statistics. 1991 Mar 1:1-67.