Extending superlearner framework to survival analysis. Includes boosted regression, random forest, decision trees, Bayesian model average, and Morse-Smale regression.
2. •Many outcomes of interest occur over
time in educational and medical
settings.
− Student dropping of a program
− Deaths after treatment in a disease
•It is of interest to model these
outcomes and understand how
various factors and interventions
impact outcomes over time.
•Several parametric and machine
learning models exist to deal with
these tasks.
− Survival analysis for persistence (Kaplan-
Meier, Cox regression, random forest)
− Generalized linear mixed models and
generalized estimating equations for non-
survival analysis longitudinal modeling
(various tree methods, as well as
parametric models)
SCOPE OF PROBLEM
Longitudinal data in educational and medical research
3. •Random forest
− Based on bootstrapped trees averaged over an
ensemble
− Allows for survival/censored outcomes (Cox
regression
− Nonlinear model fitting that can capture complex
relationships
− Provides variable selection and importance scores
•Regression trees
− Single-tree method
− Allows for survival/censored outcomes (Cox
regression)
− Some nonlinear model fitting capabilities
− Provides actual model (splits) as output
•Boosted regression
− Linear base-learners grown on previous model error
terms
− Allows for survival/censored outcomes (Cox
regression)
− Provides model fit parameters
EXTENDING SURVIVAL ANALYSIS
Cox regression with machine learning
•Others
− Bayesian model averaging gives
good accuracy but its
interpretability can encounter
issues.
− Parametric models are useful but
cannot forecast.
4. •Superlearning builds upon the concept of ensemble models.
− Ensembles combine multiple models and average/blend into a final model (similar to stacking).
− Results in lower model error.
− Instead of combining trees into a boosted model, superlearners combine multiple machine
learning/parametric models into a final boosted model.
− Theoretical guarantees about error being at least as low as the lowest model error of individual
models (at least as good of prediction as the best model in the ensemble).
•This means that we have the added value of a tree ensemble, a linear ensemble,
and a single-tree model.
− Can examine models to understand main and interaction effects that exist in the model, as well as
the role of individual variation.
COMBINING FORCES
Creating a survival analysis superlearner
5. • Simulate 100 sets of 1000 students
with survival outcome from
genSurv R package.
− 1 true predictor variable (continuous)
− 4 noise predictors binomially distributed
− Censoring ~7% of observations based
on uniform distribution
• Dataset split evenly into test and
train sets.
• Random survival forest, tree, and
boosted regression models fit to
training data and used to predict
test data.
• Predictors combined into a final
boosted superlearner.
• Assessment via MSE of each
model and combined superlearner.
• Additional information explored
within each component of
superlearner.
ASSESSING THE SURVIVAL SUPERLEARNER
Simulation and testing
Simulate
Data
Train
Random
Forest
Tree Model
Boosted
Regression
Test
Superlearner
6. •Superlearner gives a better model
fit than its best component,
boosted regression (t=11.90,
p<0.001).
− Random forest achieves an error rate
of 15% with a default of 1000 trees.
− Tree model provides useful output
about splits (mainly on true predictor),
as well as median survival times for
each leaf.
− Boosted regression model provides
parameters and model fit statistics
(AIC=1.21, df=2.67).
•The superlearner model improves
prediction accuracy and allows for
examination of the final model
through each of its components.
− Provides a good prediction model
(results likely accurate).
− Provides lots of information about
main effects and interaction terms in
the model.
− Uses methods with summary graphics
to examine the model visually.
RESULTS: FIDELITY TO SIMULATED MODEL
Superlearner yields best fit; components reveal key predictors
Component Average MSE
Random Forest 7204
Tree 58963
Boosted
Regression
<0.001
Superlearner <0.001
Factor Coefficient
Intercept 1.32
Status -0.54
Covariate -0.52
a -0.15
b 0.3
d 0.02
e 0.05
7. •Superlearner has lower error
than any of its components and
is a linear combination of its
components, typically selecting
all 3 models.
− Error rates correspond best between
boosted regression model and
superlearner.
− Error rates show more spread with
tree model and even more spread
with random forest model.
□ Suggests that random forest model
may be a hit-or-miss with accuracy.
□ Understanding this allows for better
interpretation of results compared
to other models.
□ Tree model has worse error than
random forest but corresponds
better to the final superlearner.
•Each component is significant in
the final model, suggesting that
each piece captures something
important in the data.
− Utilizing multiple models helps the
algorithm learn and reveals
important information in the data
that could not be obtained by a
single model.
RESULTS: IN DEPTH LOOK AT SUPERLEARNER FIT
Comparison of components
Component MSE
Correlation
Random
Forest
-0.31
Tree 0.73
Boosted
Regression
0.93
Superlearner 1.00
8. •Model suggests high
probability (p=0.72) of a model
with only 1 predictor (covariate)
and good model fit for 1-
predictor model (BIC=-443.84).
− 5 best models range from
covariate only model to covariate
plus 1 noise predictor (p ranges
from 0.05 to 0.09 for noise-
inclusive models).
− These models, their variables
included, and their probability of
selection can be visualized nicely
(left plot).
− Posterior model coefficient
distributions can also be
visualized nicely with a line
designation for probability that the
predictor’s coefficient is zero (right
plots).
•Improves superlearner fit and
can replace a fickle tree
algorithm that does not always
give prediction for an
observation.
ALTERNATIVE ALGORITHM: BAYESIAN SURVIVAL MODELS
Probabilistic model averaging based on data priors
Component Times
Chosen
Average
Coefficient
Covariate 5 2.04
A 1 -0.01
B 1 -0.02
D 1 -0.01
E 1 -0.02
Component Average
MSE
Random
Forest
7523
Boosted
Regression
<0.001
Bayesian
Model
<0.001
Superlearne
r
<0.001
9. •Piecewise regression
method based on level sets
(similar to color-coding of
mountain heights on maps)
and basins of attraction
(points with the same peaks
and valleys).
− Basically, this method
partitions data based on
shared basins of attraction
(nearest gradient peak and
nearest gradient trough) at
different heights of a function
(here, survival times).
□ Example of points on a sine
wave to left; each color
shows a basin of attraction
partition along this sine wave
to which points on that color
line belongs.
− Elastic net model fit to each
partition.
ALTERNATIVE ALGORITHM: MORSE-SMALE REGRESSION
Topologically-based piecewise regression
Component Average MSE
Random Forest 7523
Boosted Regression <0.001
Bayesian Model <0.001
Morse-Smale Regression <0.001
Superlearner <0.001
Sine wave
example
• Good performance (best learner in model) and separation on true
predictor variable suggests efficacy of method.
− Provides nice summary plots of partitions, as well as gradient plots for each variable
along the level sets of the survival function.
10. • Superlearner yields
better predictive
performance than
individual models.
• Superlearner keeps
information from each
individual model for
examination of
predictor-outcome
relationships and
estimate of model
accuracy of based on
those relationships.
• Superlearner allows for
piecewise visualization
of these relationships,
as well as insight for
Tableau graphics.
• The superlearner
framework is readily
extended to survival
analysis and can be a
useful tool.
CONCLUSIONS
Superlearner framework adds a useful tool for survival
Factor Coefficient
Intercept 1.32
Status -0.54
Covariate -0.52
a -0.15
b 0.3
d 0.02
e 0.05