SlideShare a Scribd company logo
LOGISTIC REGRESSION:
GEOMETRIC AND
TOPOLOGICAL
CONSIDERATIONS
By Colleen M. Farrelly
PROBLEM OVERVIEW
•Logistic regression is ubiquitous in medical research today.
• Recovery from a traumatic brain injury
• Psychiatric outcomes/relapse
• Development of resistance to HIV medications
•Logistic regression struggles under several types of data conditions:
• Small sample size relative to number of predictors (n<p)
• Sparsity (few predictors are related to outcome)
• Collinearity (variables share lots of variance)
Machine learning offers several promising solutions:
 Penalized regression
 Tree-based methods
 Boosted regression
 Neural networks
 Some return an interpretable model with regression coefficients (odds ratios); some do not.
MACHINE LEARNING EXTENSIONS
•Despite the success of machine learning in
recent years, algorithm extensions utilizing
topological or geometric information have
outperformed their base algorithm
counterparts.
• Network analytics (Forman curvature) and network
matching via topologically-based metrics
• Hodge theory extensions of graph-based ranking
algorithms
• Psychometric measure validation via Hausdorff
statistics
• Morse-Smale-based regression
• Mapper algorithm for clustering/subgroup mining
• Persistent homology to explore group
differences/shape matching problems
• Conformal mapping for image analytics
•This suggests that machine learning
algorithms, such as penalized regression, can
benefit from the addition of geometric and
topological information.
•Penalized extensions of generalized
linear models such as logistic
regression:
• LASSO imposes sparsity by adding a penalty
(L-1 norm) that shrinks near-zero
coefficients to 0.
• Similar to a cowboy at the origin roping coefficients that
get too close
• Can handle p>n situations well
• Ridge regression adds a different penalty (L-
2 norm) to create a robust model.
• Handles messy geometry (local minima/maxima…) to
yield consistent estimators
• May not impose sparsity on solutions
• Elastic net combines these penalties to
impose sparsity on solutions and yield a
robust model.
LASSO, RIDGE REGRESSION, AND
ELASTIC NET
HOMOTOPY-BASED LASSO
• Homotopy arrow example
• Red and blue arrows can be deformed
into each other by wiggling and
stretching the line path with anchors
at start and finish of line.
• Yellow arrow crosses holes and would
need to backtrack or break to the
surface to freely wiggle into the blue
or red line.
• Homotopy method in LASSO
wiggles an easy regression path
into an optimal regression path.
• Avoids obstacles that can trap other
regression estimators (peaks, valleys,
saddles…)
• Akin to removing obstacles that might
hinder the cowboy’s ability to rope
variables near the origin
• Homotopy as path
equivalence
• Intrinsic property of
topological spaces (such
as data manifolds)
•Instead of fitting model to data space, fit model to model error tangent
space:
• Deals with collinearity, as parallel vectors share a tangent space (only one selected of
collinear group)
• Separates predictors into 3 groups:
• Set of selected predictors (small angles relative to error tangent space)
• Set of redundant predictors (share tangent space with predictors)
• Set of non-selected predictors (large angles with tangent space)
•Leverages Rao scoring and Fisher Information
• Important in assessing generalized linear models
• Yields model fit statistics (BIC, AIC…)
• Forward selection for series of all possible models
• Choose best model by model fit statistics
DIFFERENTIAL GEOMETRY LARS
EXTENSIONS (DGLARS)
TESTING SET-UP: ALGORITHMS
•Set 1
• Main effects models that yield regression
coefficients for predictors
• Logistic regression
• Elastic net regression
• Homotopy LASSO
• DGLARS
• Boosted regression with linear base learners
• Multivariate adaptive regression splines (MARS)
• Bayesian model averaging (BMA)
•Set 2
• Main effects plus interaction term models
that yield regression coefficients
• Logistic regression
• Homotopy LASSO
• Boosted regression with linear base learners
• MARS
•Set 3
 Other machine learning models which
do not yield regression coefficients or
an interpretable model
 Random forest
 Extreme gradient boosting (XGBoost)
 Conditional inference tree
 Neural network
 K-nearest neighbors regression
SIMULATIONS AND REAL DATA
•Simulations:
• Comparison of sets 1 and 2
• 13 predictors (4 true predictors) and
binary outcome (0,1)
• Linear relationships (4 main effects terms)
• Nonlinear relationships (2 interaction terms)
• Mixed relationships (2 main effects, 1
interaction term)
• Added Gaussian noise and group
overlap levels:
• Low noise (0,0.25) and no group overlap
• Medium noise (0,0.5) and 5-10% group overlap
• High noise (0,0.75) and 15-20% group overlap
• Yielded a total of 9 simulation
conditions, which were replicated 10
times across sample sizes of: 500,
1000, 2500, 5000, 10000.
• Train/test splits of 70/30 for each trial
•Real dataset:
• UCI Machine Learning Repository
Wisconsin Breast Cancer Dataset
(WBCD)
• 569 individuals with 30 tumor attributes and
binary indicator of malignancy (outcome)
• Sets 1, 2, and 3 compared
• 70/30 train/test split
• Comparison of selected model
coefficients across set 1:
• Reduction of model size
• Odds ratio comparison of selected terms
• Overlap of selected terms between models
SIMULATION RESULTS
•Main effects trials (left column)
• Most algorithms perform well.
• Set 2 homotopy LASSO (stars) is optimal
among algorithms tested.
• DGLARS (triangles) performs well with low
noise/overlap or high noise/overlap.
•Interaction trials (middle column)
• Set 2 Homotopy LASSO performs well across
conditions (especially n>2500).
• DGLARS performs well with low
noise/overlap.
• DGLARS retains advantage over many main
effects models with added noise/overlap.
•Mixed trials (right column)
• DGLARS outperforms all other set 1/set 2
methods at low noise/overlap conditions
• DGLARS retains advantage over set 1/set 2
algorithms, except set 2 homotopy LASSO.
• Set 2 homotopy LASSO emerges as best
algorithm with increasing noise/overlap.
•This suggests that incorporating
geometry/topology into machine
learning algorithms can improve
performance on data with:
• Group overlap
• Noisy measurements
BCWD RESULTS: OVERVIEW OF
PERFORMANCE•Set 1 main effects models
perform well.
• Machine learning methods
improve logistic regression.
• Elastic net and homotopy
LASSO show the best
performance overall.
•Set 2 models suggest
logistic regression struggles
with the large number of
predictors relative to
sample size.
•Set 3 models demonstrate
that DGLARS and homotopy
LASSO perform comparably
well to nonparametric
machine learning models,
yielding lower and more
balanced error.
BCWD ODDS RATIO COMPARISON: SET
1 ALGORITHMS •Most algorithms
reduced the predictor
set by more than half.
•Many algorithms
struggled with data
geometry (singularities,
local optima…), yielding
odds ratios of >1000
(set to 10 in graph).
•Homotopy LASSO
offered a solution with
finite odds ratio
estimates/coefficients.
• Suggests its potential for
solving multivariate
regression on messy
datasets
• Yields interpretable,
bounded odds ratios when
other algorithms fail
CONCLUSIONS
•This study suggests the potential for new logistic regression
algorithms that incorporate geometric and topological information.
• DGLARS and homotopy LASSO perform well on simulated data, particularly on
messier problems with main effects and interaction terms with some noise/group
overlap.
• Homotopy LASSO and DGLARS perform well on BCWD compared to nonparametric
machine learning algorithms and produce interpretable linear models.
• Homotopy LASSO yields finite odds ratios where other regression algorithms fail.
•More work should be done to incorporate geometric/topological
methods into existing machine learning algorithms (particularly those
based on generalized linear regression with interpretable models).
•Further empirical testing could include:
• Multinomial regression (3+ category outcomes)
• Tweedie/Poisson regression (count outcomes)

More Related Content

What's hot

Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
Shubhmay Potdar
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theorysia16
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
Upekha Vandebona
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]
AAKANKSHA JAIN
 
Cross validation
Cross validationCross validation
Cross validation
RidhaAfrawe
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
Bangalore
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selectionchenhm
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
NTUConcepts1
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
DataminingTools Inc
 
Summer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of LondonSummer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of London
Yash Khanna
 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
Fellowship at Vodafone FutureLab
 
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
akanni azeez olamide
 
Morse-Smale Regression
Morse-Smale RegressionMorse-Smale Regression
Morse-Smale Regression
Colleen Farrelly
 
Neural network
Neural networkNeural network
Neural network
KRISH na TimeTraveller
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
sheetal katkar
 
Computational learning theory
Computational learning theoryComputational learning theory
Computational learning theory
swapnac12
 
Sampling techniques in research
Sampling techniques in researchSampling techniques in research
Sampling techniques in researchJulie Atwebembeire
 
Simulation Models as a Research Method.ppt
Simulation Models as a Research Method.pptSimulation Models as a Research Method.ppt
Simulation Models as a Research Method.ppt
QidiwQidiwQidiw
 
Feature selection
Feature selectionFeature selection
Feature selection
Dong Guo
 

What's hot (20)

Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Bayseian decision theory
Bayseian decision theoryBayseian decision theory
Bayseian decision theory
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]
 
Cross validation
Cross validationCross validation
Cross validation
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Summer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of LondonSummer Report on Mathematics for Machine learning: Imperial College of London
Summer Report on Mathematics for Machine learning: Imperial College of London
 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
 
Missing Data and Causes
Missing Data and CausesMissing Data and Causes
Missing Data and Causes
 
Morse-Smale Regression
Morse-Smale RegressionMorse-Smale Regression
Morse-Smale Regression
 
Neural network
Neural networkNeural network
Neural network
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
 
Computational learning theory
Computational learning theoryComputational learning theory
Computational learning theory
 
Sampling techniques in research
Sampling techniques in researchSampling techniques in research
Sampling techniques in research
 
Simulation Models as a Research Method.ppt
Simulation Models as a Research Method.pptSimulation Models as a Research Method.ppt
Simulation Models as a Research Method.ppt
 
Feature selection
Feature selectionFeature selection
Feature selection
 

Similar to Logistic regression: topological and geometric considerations

Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Weiyang Tong
 
Survival Analysis Superlearner
Survival Analysis SuperlearnerSurvival Analysis Superlearner
Survival Analysis Superlearner
Colleen Farrelly
 
Shift-Robust Node Classification via Graph Adversarial Clustering Neurips 202...
Shift-Robust Node Classification via Graph Adversarial Clustering Neurips 202...Shift-Robust Node Classification via Graph Adversarial Clustering Neurips 202...
Shift-Robust Node Classification via Graph Adversarial Clustering Neurips 202...
ssuser2624f71
 
Quantum generalized linear models
Quantum generalized linear modelsQuantum generalized linear models
Quantum generalized linear models
Colleen Farrelly
 
AUC: at what cost(s)?
AUC: at what cost(s)?AUC: at what cost(s)?
AUC: at what cost(s)?
Alexander Korbonits
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problems
Colleen Farrelly
 
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Artificial Intelligence Institute at UofSC
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
Alex Henderson
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautzbutest
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautzbutest
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2O
Sri Ambati
 
Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...
ijcsity
 
15303589.ppt
15303589.ppt15303589.ppt
15303589.ppt
ABINASHPADHY6
 
PyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear ModelsPyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear Models
Colleen Farrelly
 
Metaheuristics
MetaheuristicsMetaheuristics
Metaheuristics
ossein jain
 
Morse-Smale Regression for Risk Modeling
Morse-Smale Regression for Risk ModelingMorse-Smale Regression for Risk Modeling
Morse-Smale Regression for Risk Modeling
Colleen Farrelly
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningjaumebp
 
Contrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and ClassificationContrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and Classification
Artificial Intelligence Institute at UofSC
 

Similar to Logistic regression: topological and geometric considerations (20)

Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
Multi-Domain Diversity Preservation to Mitigate Particle Stagnation and Enab...
 
Survival Analysis Superlearner
Survival Analysis SuperlearnerSurvival Analysis Superlearner
Survival Analysis Superlearner
 
Shift-Robust Node Classification via Graph Adversarial Clustering Neurips 202...
Shift-Robust Node Classification via Graph Adversarial Clustering Neurips 202...Shift-Robust Node Classification via Graph Adversarial Clustering Neurips 202...
Shift-Robust Node Classification via Graph Adversarial Clustering Neurips 202...
 
Daniel Lee STAN
Daniel Lee STANDaniel Lee STAN
Daniel Lee STAN
 
Quantum generalized linear models
Quantum generalized linear modelsQuantum generalized linear models
Quantum generalized linear models
 
AUC: at what cost(s)?
AUC: at what cost(s)?AUC: at what cost(s)?
AUC: at what cost(s)?
 
Deep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problemsDeep vs diverse architectures for classification problems
Deep vs diverse architectures for classification problems
 
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
 
Learning to Search Henry Kautz
Learning to Search Henry KautzLearning to Search Henry Kautz
Learning to Search Henry Kautz
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2O
 
Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...Performance analysis of regularized linear regression models for oxazolines a...
Performance analysis of regularized linear regression models for oxazolines a...
 
15303589.ppt
15303589.ppt15303589.ppt
15303589.ppt
 
PyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear ModelsPyData Miami 2019, Quantum Generalized Linear Models
PyData Miami 2019, Quantum Generalized Linear Models
 
Metaheuristics
MetaheuristicsMetaheuristics
Metaheuristics
 
Morse-Smale Regression for Risk Modeling
Morse-Smale Regression for Risk ModelingMorse-Smale Regression for Risk Modeling
Morse-Smale Regression for Risk Modeling
 
Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learning
 
Contrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and ClassificationContrast Pattern Aided Regression and Classification
Contrast Pattern Aided Regression and Classification
 

More from Colleen Farrelly

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
Colleen Farrelly
 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023
Colleen Farrelly
 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptx
Colleen Farrelly
 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptx
Colleen Farrelly
 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptx
Colleen Farrelly
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
Colleen Farrelly
 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptx
Colleen Farrelly
 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptx
Colleen Farrelly
 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptx
Colleen Farrelly
 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptx
Colleen Farrelly
 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptx
Colleen Farrelly
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptx
Colleen Farrelly
 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptx
Colleen Farrelly
 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved Areas
Colleen Farrelly
 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptx
Colleen Farrelly
 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptx
Colleen Farrelly
 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptx
Colleen Farrelly
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
Colleen Farrelly
 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk
Colleen Farrelly
 
WIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network ScienceWIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network Science
Colleen Farrelly
 

More from Colleen Farrelly (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023Hands-On Network Science, PyData Global 2023
Hands-On Network Science, PyData Global 2023
 
Modeling Climate Change.pptx
Modeling Climate Change.pptxModeling Climate Change.pptx
Modeling Climate Change.pptx
 
Natural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptxNatural Language Processing for Beginners.pptx
Natural Language Processing for Beginners.pptx
 
The Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptxThe Shape of Data--ODSC.pptx
The Shape of Data--ODSC.pptx
 
Generative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptxGenerative AI, WiDS 2023.pptx
Generative AI, WiDS 2023.pptx
 
Emerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptxEmerging Technologies for Public Health in Remote Locations.pptx
Emerging Technologies for Public Health in Remote Locations.pptx
 
Applications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptxApplications of Forman-Ricci Curvature.pptx
Applications of Forman-Ricci Curvature.pptx
 
Geometry for Social Good.pptx
Geometry for Social Good.pptxGeometry for Social Good.pptx
Geometry for Social Good.pptx
 
Topology for Time Series.pptx
Topology for Time Series.pptxTopology for Time Series.pptx
Topology for Time Series.pptx
 
Time Series Applications AMLD.pptx
Time Series Applications AMLD.pptxTime Series Applications AMLD.pptx
Time Series Applications AMLD.pptx
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptx
 
An introduction to time series data with R.pptx
An introduction to time series data with R.pptxAn introduction to time series data with R.pptx
An introduction to time series data with R.pptx
 
NLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved AreasNLP: Challenges and Opportunities in Underserved Areas
NLP: Challenges and Opportunities in Underserved Areas
 
Topological Data Analysis.pptx
Topological Data Analysis.pptxTopological Data Analysis.pptx
Topological Data Analysis.pptx
 
Transforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptxTransforming Text Data to Matrix Data via Embeddings.pptx
Transforming Text Data to Matrix Data via Embeddings.pptx
 
Natural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptxNatural Language Processing in the Wild.pptx
Natural Language Processing in the Wild.pptx
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
 
2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk2021 American Mathematical Society Data Science Talk
2021 American Mathematical Society Data Science Talk
 
WIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network ScienceWIDS 2021--An Introduction to Network Science
WIDS 2021--An Introduction to Network Science
 

Recently uploaded

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 

Recently uploaded (20)

FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 

Logistic regression: topological and geometric considerations

  • 2. PROBLEM OVERVIEW •Logistic regression is ubiquitous in medical research today. • Recovery from a traumatic brain injury • Psychiatric outcomes/relapse • Development of resistance to HIV medications •Logistic regression struggles under several types of data conditions: • Small sample size relative to number of predictors (n<p) • Sparsity (few predictors are related to outcome) • Collinearity (variables share lots of variance) Machine learning offers several promising solutions:  Penalized regression  Tree-based methods  Boosted regression  Neural networks  Some return an interpretable model with regression coefficients (odds ratios); some do not.
  • 3. MACHINE LEARNING EXTENSIONS •Despite the success of machine learning in recent years, algorithm extensions utilizing topological or geometric information have outperformed their base algorithm counterparts. • Network analytics (Forman curvature) and network matching via topologically-based metrics • Hodge theory extensions of graph-based ranking algorithms • Psychometric measure validation via Hausdorff statistics • Morse-Smale-based regression • Mapper algorithm for clustering/subgroup mining • Persistent homology to explore group differences/shape matching problems • Conformal mapping for image analytics •This suggests that machine learning algorithms, such as penalized regression, can benefit from the addition of geometric and topological information.
  • 4. •Penalized extensions of generalized linear models such as logistic regression: • LASSO imposes sparsity by adding a penalty (L-1 norm) that shrinks near-zero coefficients to 0. • Similar to a cowboy at the origin roping coefficients that get too close • Can handle p>n situations well • Ridge regression adds a different penalty (L- 2 norm) to create a robust model. • Handles messy geometry (local minima/maxima…) to yield consistent estimators • May not impose sparsity on solutions • Elastic net combines these penalties to impose sparsity on solutions and yield a robust model. LASSO, RIDGE REGRESSION, AND ELASTIC NET
  • 5. HOMOTOPY-BASED LASSO • Homotopy arrow example • Red and blue arrows can be deformed into each other by wiggling and stretching the line path with anchors at start and finish of line. • Yellow arrow crosses holes and would need to backtrack or break to the surface to freely wiggle into the blue or red line. • Homotopy method in LASSO wiggles an easy regression path into an optimal regression path. • Avoids obstacles that can trap other regression estimators (peaks, valleys, saddles…) • Akin to removing obstacles that might hinder the cowboy’s ability to rope variables near the origin • Homotopy as path equivalence • Intrinsic property of topological spaces (such as data manifolds)
  • 6. •Instead of fitting model to data space, fit model to model error tangent space: • Deals with collinearity, as parallel vectors share a tangent space (only one selected of collinear group) • Separates predictors into 3 groups: • Set of selected predictors (small angles relative to error tangent space) • Set of redundant predictors (share tangent space with predictors) • Set of non-selected predictors (large angles with tangent space) •Leverages Rao scoring and Fisher Information • Important in assessing generalized linear models • Yields model fit statistics (BIC, AIC…) • Forward selection for series of all possible models • Choose best model by model fit statistics DIFFERENTIAL GEOMETRY LARS EXTENSIONS (DGLARS)
  • 7. TESTING SET-UP: ALGORITHMS •Set 1 • Main effects models that yield regression coefficients for predictors • Logistic regression • Elastic net regression • Homotopy LASSO • DGLARS • Boosted regression with linear base learners • Multivariate adaptive regression splines (MARS) • Bayesian model averaging (BMA) •Set 2 • Main effects plus interaction term models that yield regression coefficients • Logistic regression • Homotopy LASSO • Boosted regression with linear base learners • MARS •Set 3  Other machine learning models which do not yield regression coefficients or an interpretable model  Random forest  Extreme gradient boosting (XGBoost)  Conditional inference tree  Neural network  K-nearest neighbors regression
  • 8. SIMULATIONS AND REAL DATA •Simulations: • Comparison of sets 1 and 2 • 13 predictors (4 true predictors) and binary outcome (0,1) • Linear relationships (4 main effects terms) • Nonlinear relationships (2 interaction terms) • Mixed relationships (2 main effects, 1 interaction term) • Added Gaussian noise and group overlap levels: • Low noise (0,0.25) and no group overlap • Medium noise (0,0.5) and 5-10% group overlap • High noise (0,0.75) and 15-20% group overlap • Yielded a total of 9 simulation conditions, which were replicated 10 times across sample sizes of: 500, 1000, 2500, 5000, 10000. • Train/test splits of 70/30 for each trial •Real dataset: • UCI Machine Learning Repository Wisconsin Breast Cancer Dataset (WBCD) • 569 individuals with 30 tumor attributes and binary indicator of malignancy (outcome) • Sets 1, 2, and 3 compared • 70/30 train/test split • Comparison of selected model coefficients across set 1: • Reduction of model size • Odds ratio comparison of selected terms • Overlap of selected terms between models
  • 9. SIMULATION RESULTS •Main effects trials (left column) • Most algorithms perform well. • Set 2 homotopy LASSO (stars) is optimal among algorithms tested. • DGLARS (triangles) performs well with low noise/overlap or high noise/overlap. •Interaction trials (middle column) • Set 2 Homotopy LASSO performs well across conditions (especially n>2500). • DGLARS performs well with low noise/overlap. • DGLARS retains advantage over many main effects models with added noise/overlap. •Mixed trials (right column) • DGLARS outperforms all other set 1/set 2 methods at low noise/overlap conditions • DGLARS retains advantage over set 1/set 2 algorithms, except set 2 homotopy LASSO. • Set 2 homotopy LASSO emerges as best algorithm with increasing noise/overlap. •This suggests that incorporating geometry/topology into machine learning algorithms can improve performance on data with: • Group overlap • Noisy measurements
  • 10. BCWD RESULTS: OVERVIEW OF PERFORMANCE•Set 1 main effects models perform well. • Machine learning methods improve logistic regression. • Elastic net and homotopy LASSO show the best performance overall. •Set 2 models suggest logistic regression struggles with the large number of predictors relative to sample size. •Set 3 models demonstrate that DGLARS and homotopy LASSO perform comparably well to nonparametric machine learning models, yielding lower and more balanced error.
  • 11. BCWD ODDS RATIO COMPARISON: SET 1 ALGORITHMS •Most algorithms reduced the predictor set by more than half. •Many algorithms struggled with data geometry (singularities, local optima…), yielding odds ratios of >1000 (set to 10 in graph). •Homotopy LASSO offered a solution with finite odds ratio estimates/coefficients. • Suggests its potential for solving multivariate regression on messy datasets • Yields interpretable, bounded odds ratios when other algorithms fail
  • 12. CONCLUSIONS •This study suggests the potential for new logistic regression algorithms that incorporate geometric and topological information. • DGLARS and homotopy LASSO perform well on simulated data, particularly on messier problems with main effects and interaction terms with some noise/group overlap. • Homotopy LASSO and DGLARS perform well on BCWD compared to nonparametric machine learning algorithms and produce interpretable linear models. • Homotopy LASSO yields finite odds ratios where other regression algorithms fail. •More work should be done to incorporate geometric/topological methods into existing machine learning algorithms (particularly those based on generalized linear regression with interpretable models). •Further empirical testing could include: • Multinomial regression (3+ category outcomes) • Tweedie/Poisson regression (count outcomes)

Editor's Notes

  1. Grant BF, Dawson DA. Age at onset of alcohol use and its association with DSM-IV alcohol abuse and dependence: results from the National Longitudinal Alcohol Epidemiologic Survey. Journal of substance abuse. 1997 Dec 31;9:103-10. Andrews PJ, Sleeman DH, Statham PF, McQuatt A, Corruble V, Jones PA, Howells TP, Macmillan CS. Predicting recovery in patients suffering from traumatic brain injury by using admission variables and physiological data: a comparison between decision tree analysis and logistic regression. Journal of neurosurgery. 2002 Aug;97(2):326-36. Pflueger MO, Franke I, Graf M, Hachtel H. Predicting general criminal recidivism in mentally disordered offenders using a random forest approach. BMC psychiatry. 2015 Mar 29;15(1):62. Sinisi SE, Polley EC, Petersen ML, Rhee SY, van der Laan MJ. Super learning: an application to the prediction of HIV-1 drug resistance. Statistical applications in genetics and molecular biology. 2007 Jan 1;6(1). Heidema AG, Boer JM, Nagelkerke N, Mariman EC, Feskens EJ. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC genetics. 2006 Apr 21;7(1):23. Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010 Jan;20(1):101.
  2. Weber M, Saucan E, Jost J. Characterizing Complex Networks with Forman-Ricci curvature and associated geometric flows. arXiv preprint arXiv:1607.08654. 2016 Jul 28. Lee H, Ma Z, Wang Y, Chung MK. Topological Distances between Networks and Its Application to Brain Imaging. arXiv preprint arXiv:1701.04171. 2017 Jan 16. Xu Q, Jiang T, Yao Y, Huang Q, Yan B, Lin W. Random partial paired comparison for subjective video quality assessment via HodgeRank. InProceedings of the 19th ACM international conference on Multimedia 2011 Nov 28 (pp. 393-402). ACM. Wang Y, Shi J, Yin X, Gu X, Chan TF, Yau ST, Toga AW, Thompson PM. Brain surface conformal parameterization with the Ricci flow. IEEE transactions on medical imaging. 2012 Feb;31(2):251-64. Gerber S, Rübel O, Bremer PT, Pascucci V, Whitaker RT. Morse–smale regression. Journal of Computational and Graphical Statistics. 2013 Jan 1;22(1):193-214. Lum PY, Singh G, Lehman A, Ishkanov T, Vejdemo-Johansson M, Alagappan M, Carlsson J, Carlsson G. Extracting insights from the shape of complex data using topology. Scientific reports. 2013 Feb 7;3:1236. Moon C, Giansiracusa N, Lazar N. Persistence Terrace for Topological Inference of Point Cloud Data. arXiv preprint arXiv:1705.02037. 2017 May 4. Lee H, Ma Z, Wang Y, Chung MK. Topological Distances between Networks and Its Application to Brain Imaging. arXiv preprint arXiv:1701.04171. 2017 Jan 16. Bendich P, Gasparovic E, Tralie CJ, Harer J. Scaffoldings and Spines: Organizing High-Dimensional Data Using Cover Trees, Local Principal Component Analysis, and Persistent Homology. arXiv preprint arXiv:1602.06245. 2016 Feb 19. Farrelly CM, Schwartz SJ, Amodeo AL, Feaster DJ, Steinley DL, Meca A, Picariello S. The Analysis of Bridging Constructs with Hierarchical Clustering Methods: An application to identity. Journal of Research in Personality. 2017 Jun 29.
  3. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.
  4. Osborne MR, Presnell B, Turlach BA. A new approach to variable selection in least squares problems. IMA journal of numerical analysis. 2000 Jul 1;20(3):389-403.
  5. Augugliaro, L., Mineo, A. M., & Wit, E. C. (2013). Differential geometric least angle regression: a differential geometric approach to sparse generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 471-498. Augugliaro, L., & Mineo, A. M. (2015). Using the dglars Package to Estimate a Sparse Generalized Linear Model. In Advances in Statistical Models for Data Analysis (pp. 1-8). Springer International Publishing.
  6. Raftery AE, Madigan D, Hoeting JA. Bayesian model averaging for linear regression models. Journal of the American Statistical Association. 1997 Mar 1;92(437):179-91. Breiman L. Random forests. Machine learning. 2001 Oct 1;45(1):5-32. Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics. 2006 Sep 1;15(3):651-74. Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician. 1992 Aug 1;46(3):175-85. Bebis G, Georgiopoulos M. Feed-forward neural networks. IEEE Potentials. 1994 Oct;13(4):27-31. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. InProceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016 Aug 13 (pp. 785-794). ACM. Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics. 2000;28(2):337-407. Friedman JH. Multivariate adaptive regression splines. The annals of statistics. 1991 Mar 1:1-67.
  7. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)