Aug. 17, 2017•0 likes•1,011 views

Download to read offline

Report

Data & Analytics

Extending superlearner framework to survival analysis. Includes boosted regression, random forest, decision trees, Bayesian model average, and Morse-Smale regression.

Colleen FarrellyFollow

Logistic regression: topological and geometric considerationsColleen Farrelly

Machine Learning by Analogy IIColleen Farrelly

Deep vs diverse architectures for classification problemsColleen Farrelly

Morse-Smale RegressionColleen Farrelly

Hierarchical clustering and topology for psychometric validationColleen Farrelly

Topology for data scienceColleen Farrelly

- 1. Survival Analysis Superlearner Methods for tracking time-to-event data
- 2. •Many outcomes of interest occur over time in educational and medical settings. − Student dropping of a program − Deaths after treatment in a disease •It is of interest to model these outcomes and understand how various factors and interventions impact outcomes over time. •Several parametric and machine learning models exist to deal with these tasks. − Survival analysis for persistence (Kaplan- Meier, Cox regression, random forest) − Generalized linear mixed models and generalized estimating equations for non- survival analysis longitudinal modeling (various tree methods, as well as parametric models) SCOPE OF PROBLEM Longitudinal data in educational and medical research
- 3. •Random forest − Based on bootstrapped trees averaged over an ensemble − Allows for survival/censored outcomes (Cox regression − Nonlinear model fitting that can capture complex relationships − Provides variable selection and importance scores •Regression trees − Single-tree method − Allows for survival/censored outcomes (Cox regression) − Some nonlinear model fitting capabilities − Provides actual model (splits) as output •Boosted regression − Linear base-learners grown on previous model error terms − Allows for survival/censored outcomes (Cox regression) − Provides model fit parameters EXTENDING SURVIVAL ANALYSIS Cox regression with machine learning •Others − Bayesian model averaging gives good accuracy but its interpretability can encounter issues. − Parametric models are useful but cannot forecast.
- 4. •Superlearning builds upon the concept of ensemble models. − Ensembles combine multiple models and average/blend into a final model (similar to stacking). − Results in lower model error. − Instead of combining trees into a boosted model, superlearners combine multiple machine learning/parametric models into a final boosted model. − Theoretical guarantees about error being at least as low as the lowest model error of individual models (at least as good of prediction as the best model in the ensemble). •This means that we have the added value of a tree ensemble, a linear ensemble, and a single-tree model. − Can examine models to understand main and interaction effects that exist in the model, as well as the role of individual variation. COMBINING FORCES Creating a survival analysis superlearner
- 5. • Simulate 100 sets of 1000 students with survival outcome from genSurv R package. − 1 true predictor variable (continuous) − 4 noise predictors binomially distributed − Censoring ~7% of observations based on uniform distribution • Dataset split evenly into test and train sets. • Random survival forest, tree, and boosted regression models fit to training data and used to predict test data. • Predictors combined into a final boosted superlearner. • Assessment via MSE of each model and combined superlearner. • Additional information explored within each component of superlearner. ASSESSING THE SURVIVAL SUPERLEARNER Simulation and testing Simulate Data Train Random Forest Tree Model Boosted Regression Test Superlearner
- 6. •Superlearner gives a better model fit than its best component, boosted regression (t=11.90, p<0.001). − Random forest achieves an error rate of 15% with a default of 1000 trees. − Tree model provides useful output about splits (mainly on true predictor), as well as median survival times for each leaf. − Boosted regression model provides parameters and model fit statistics (AIC=1.21, df=2.67). •The superlearner model improves prediction accuracy and allows for examination of the final model through each of its components. − Provides a good prediction model (results likely accurate). − Provides lots of information about main effects and interaction terms in the model. − Uses methods with summary graphics to examine the model visually. RESULTS: FIDELITY TO SIMULATED MODEL Superlearner yields best fit; components reveal key predictors Component Average MSE Random Forest 7204 Tree 58963 Boosted Regression <0.001 Superlearner <0.001 Factor Coefficient Intercept 1.32 Status -0.54 Covariate -0.52 a -0.15 b 0.3 d 0.02 e 0.05
- 7. •Superlearner has lower error than any of its components and is a linear combination of its components, typically selecting all 3 models. − Error rates correspond best between boosted regression model and superlearner. − Error rates show more spread with tree model and even more spread with random forest model. □ Suggests that random forest model may be a hit-or-miss with accuracy. □ Understanding this allows for better interpretation of results compared to other models. □ Tree model has worse error than random forest but corresponds better to the final superlearner. •Each component is significant in the final model, suggesting that each piece captures something important in the data. − Utilizing multiple models helps the algorithm learn and reveals important information in the data that could not be obtained by a single model. RESULTS: IN DEPTH LOOK AT SUPERLEARNER FIT Comparison of components Component MSE Correlation Random Forest -0.31 Tree 0.73 Boosted Regression 0.93 Superlearner 1.00
- 8. •Model suggests high probability (p=0.72) of a model with only 1 predictor (covariate) and good model fit for 1- predictor model (BIC=-443.84). − 5 best models range from covariate only model to covariate plus 1 noise predictor (p ranges from 0.05 to 0.09 for noise- inclusive models). − These models, their variables included, and their probability of selection can be visualized nicely (left plot). − Posterior model coefficient distributions can also be visualized nicely with a line designation for probability that the predictor’s coefficient is zero (right plots). •Improves superlearner fit and can replace a fickle tree algorithm that does not always give prediction for an observation. ALTERNATIVE ALGORITHM: BAYESIAN SURVIVAL MODELS Probabilistic model averaging based on data priors Component Times Chosen Average Coefficient Covariate 5 2.04 A 1 -0.01 B 1 -0.02 D 1 -0.01 E 1 -0.02 Component Average MSE Random Forest 7523 Boosted Regression <0.001 Bayesian Model <0.001 Superlearne r <0.001
- 9. •Piecewise regression method based on level sets (similar to color-coding of mountain heights on maps) and basins of attraction (points with the same peaks and valleys). − Basically, this method partitions data based on shared basins of attraction (nearest gradient peak and nearest gradient trough) at different heights of a function (here, survival times). □ Example of points on a sine wave to left; each color shows a basin of attraction partition along this sine wave to which points on that color line belongs. − Elastic net model fit to each partition. ALTERNATIVE ALGORITHM: MORSE-SMALE REGRESSION Topologically-based piecewise regression Component Average MSE Random Forest 7523 Boosted Regression <0.001 Bayesian Model <0.001 Morse-Smale Regression <0.001 Superlearner <0.001 Sine wave example • Good performance (best learner in model) and separation on true predictor variable suggests efficacy of method. − Provides nice summary plots of partitions, as well as gradient plots for each variable along the level sets of the survival function.
- 10. • Superlearner yields better predictive performance than individual models. • Superlearner keeps information from each individual model for examination of predictor-outcome relationships and estimate of model accuracy of based on those relationships. • Superlearner allows for piecewise visualization of these relationships, as well as insight for Tableau graphics. • The superlearner framework is readily extended to survival analysis and can be a useful tool. CONCLUSIONS Superlearner framework adds a useful tool for survival Factor Coefficient Intercept 1.32 Status -0.54 Covariate -0.52 a -0.15 b 0.3 d 0.02 e 0.05