Stochastic Gradient Boosting is the algorithm that underlies the TreeNet application. Discussing this at a Salford conference is like bringing coal to Newcastle – won’t embarrass myself –Extorts several characteristics that are attractive for EMR ( and most any) datasets
The vision …..Streamlined sequence of processes EMR Predictive Model Decision Support Clinical Workflow Creation at Point of Care Capture data entered as Automated E-T-L Vendor ‘neutral’ scoring part of routine clinical processes tools: workflow Machine learning Intranet based algorithms for target class JSON serialization prediction
Agenda / Table of contents1 Readmission after Heart Failure2 Data Structure of an Electronic Medical Record3 TreeNet™ Modeling with our Dataset4 Lessons Learned and Next Step(s)
Model AccuracyCompleteness of Set Feature Selection Feature Fit Target ClassKattan MW, EuroUro. 2011; (59): 566-567
Model Error: The Bias – Variance Decomposition Prediction Error = Irreducible Error + Bias² + VarianceHastie T, Tibshirani R, Friedman JH. The elements of statistical learning :Data mining, inference, and prediction. New York, NY: Springer; 2009.
Model caveats Association does not prove causality Models are retrospective (observational) and therefore hypothesis generating (i.e. not hypothesis proving)
Congestive Heart Failure Common cause for admission. Readmission in excess of 23%. Bueno, H. et al. JAMA 2010;303:2141-2147 Risk factors for readmission extensively studied. Published reviews cite over 120 studies. - Methods: Logistic regression; Cox proportional hazard - C-statistic in 0.6-0.7 range Reduction of readmission has been declared a national goal. Improved risk models have the potential to more effectively deploy targeted disease management.
EMR data structure Data collected for clinical workflow. Large volume - Multiple observations; repeated measures - Many interactions and interdependencies Complex dataset - Continuous, Ordinal, Nominal (low and high order), Binary - ‘High-order variable-dimension nominal variables’ Missing data: - May represent error or practice patterns Unbalanced classes Outliers and Entry errors
Preliminary Dataset- 1612 consecutive heart failure discharges abstracted- 1280 candidate predictors screened- Target class: Readmission at 30 days ( binary )Administrative candidate predictors Clinical candidate predictors•Admission source, status, service •Specialty medical services consulted•Age, gender, race •Specialty ancillary services consulted•Primary/secondary payers •Blood laboratory values•Primary/secondary diagnoses (names •Medications name / therapeutic classand condition categories) •Dosages of medications•Total length of stay, ICU length of stay •Patient weights during hospitalization•Hospital costs and charges •Transfusions during hospitalization•Discharge status and disposition •Nursing assessments•All-cause same-center admission in •Education topicspreceding year •Diagnostic tests ordered •Ordersets utilized Preliminary Unpublished Data
Benefits of Stochastic Gradient Boosting Friedman JH. Stochastic gradient boosting. Computational Statistics and Data Analysis 2002; 38(4):367- 378. Input and processing Output Does not require data High model accuracy transformation Classification and regression Handles large numbers of Non-parametric application of categorical and continuous logistic , L1, L2, or Huber-M variables loss function Has mechanisms for: - Feature selection - Managing missing values - Assessing the relationship of predictors to target Robust to: - Data entry errors, Outliers, Target misclassification
TreeNet™ Modeling with our Dataset1 Parameters of ‘feature fit’2 Parameters of ‘feature selection’3 Elements of insight4 Putting it all together
TreeNet – parameters of ‘feature fit’ Do not forget the manual …..
Feature selection – variable importance Variable Importance Calculation Squared relative improvement is the sum of squared improvements (in squared error risk) over all internal nodes for which the variable was chosen as the splitting variable Measurements are relative and is customary to express largest at 100 and scale other variables appropriately
Insight into the modelIlluminating the ‘black box’ with partial dependence Preliminary Unpublished Data
Approach to feature selectionDomain ‘Neutral’ vs. Domain ‘Centric’ Domain Neutral Both Domain Centric Start with a subset Know your data Use all potential based on univariate Univariate stats predictors significance (i.e. P- Use knowledge of value below a given Application of Variable Importance target and predictors to level) or variance make decisions on above a given Screening with inclusion (or rejection) threshold batteries of predictor Forward and backward stepwise progression
Model VariabilityEstablishing AUC precision and accuracy Variation Accuracy / Precision The model is fit via sampling (i.e. S.E.M.= S.D. / sqrt ( N ) stochastic) process. Precision (95%) ≈ 4 * S.E.M. S.D. = 0.03 N trials 10 30 300 S.E.M. .0095 .0055 .0017 Precision (95%) .038 .022 .007
Precision and predictor selection STEP_1 (197) (0.531) STEP_66 (737) (0.703) 0.75 Min = 0.5057 Median = 0.6738 0.70 Mean = 0.6500 Max = 0.7034 Avg. ROC 0.65 Test ROC 0.60 0.55 0.50 AUC estimated using CV-10 ( = 10 trees) SEM .0095 and precision (95%) of .038 Repeating CV-10 (using CVR battery) 30 times SEM .0017 and precision (95%) of .007 Profound implication on dimensionality of model achievable without domain knowledge input.
How much of a change in AUC is clinically relevant ?Gain Curve complements ROC curve Preliminary Unpublished Data
Useful batteries for feature selectionMethods of forward and backward selection STEPWISE Testing set to CV-10 Select predictor 1-2 at a time Confirm with CVR battery SHAVING
BUILDING A MODEL **Each change confirmed with CVR (30 reps). Review partialThis is a multi-step process dependence plot.Run model with all candidate Use backward and forward Re-examine discardedpredictors. Select N highest selection to reduce predictors in smallerimportant predictors. preliminary model to a core groups. Use backwardN= 2-3 x final size 5 -15 predictors.** and forward selection.** Step 1 Step 2 Step 3 Step 4 Step 5 Run batteries to assess Review predictors and use domain parameters of feature ‘fit’. knowledge to eliminate redundant Assess model (AUC) (dependent) predictors and consider variability. Repeat as predictors of known value. ** needed through process.
Initial runsInformation content and irreducible error #287 (0.519) #287 (0.519) (0.436) 0.576 0.6 Cross Entropy 0.576 0.5 0.4 0.3 Train 0.2 0.1 Test 6 Nodes 0.0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Number of Trees #880 (0.518) #880 (0.518) (0.457) 0.576 0.6 Cross Entropy 0.577 0.5 0.4 0.3 Train 0.2 0.1 Test 2 Nodes 0.0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Number of Trees Preliminary Unpublished Data
Sample Model GAIN CURVE ROC CURVE FEATURE SELECTION SET MODEL TRAINING SET Preliminary Unpublished Data
Sample partial dependence plotsThe value of non-parametric regression Admissions within prior year ICU Days Anion Gap Initial Systolic BP Final BNP BUN-Creatinine Ratio Preliminary Unpublished Data
Prospective applicationAdditional heart failure discharges can be scored against the model GAIN curve ROC curve Preliminary Unpublished Causes for performance shift Data Overfitting in the original model Concomitant intervention programs are altering patient risk of readmission
Non-influential candidate predictorsModels favor continuous over binary ‘dummy’ variables Diagnoses and QualNet Condition Categories Medications and Therapeutic Categories Diagnostic Tests Ordersets Submitted Preliminary Unpublished Data
Lessons learned TreeNet ( stochastic gradient boosting) is extremely well suited to data structure of EMR data. Insight in to dataset is a rich feature (in and above prediction performance). Model performance variance is important in feature selection. - Consequence of limited information content in our dataset. Batteries are useful. - PARTITION – Variability assessment - CVR – Model assessment - STEPWISE – Forward selection - SHAVING – Backward selection There is great value in learning on a non-trivial dataset within a familiar domain.
Next steps ……Explore options to manage model variabilityand increase dimensionality of predictor set.Extend analysis of predictor interactions.Develop mechanism of ‘point-of-care’ patientscoring.Apply techniques to new problems anddataset.