Predicting Hospital Readmission Using TreeNet

Predicting Hospital Readmission Using TreeNet






Total Views
Views on SlideShare
Embed Views



4 Embeds 604 525 71 7 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Stochastic Gradient Boosting is the algorithm that underlies the TreeNet application. Discussing this at a Salford conference is like bringing coal to Newcastle – won’t embarrass myself –Extorts several characteristics that are attractive for EMR ( and most any) datasets

Predicting Hospital Readmission Using TreeNet Predicting Hospital Readmission Using TreeNet Presentation Transcript

  • Predicting Hospital Readmissionusing TreeNet™Robert Aronoff MD
  • The vision …..Streamlined sequence of processes EMR Predictive Model Decision Support Clinical Workflow Creation at Point of Care  Capture data entered as  Automated E-T-L  Vendor ‘neutral’ scoring part of routine clinical processes tools: workflow  Machine learning  Intranet based algorithms for target class  JSON serialization prediction
  • Agenda / Table of contents1 Readmission after Heart Failure2 Data Structure of an Electronic Medical Record3 TreeNet™ Modeling with our Dataset4 Lessons Learned and Next Step(s)
  • Data Modeling Paradigm© Salford Systems, 2011
  • Model AccuracyCompleteness of Set Feature Selection Feature Fit Target ClassKattan MW, EuroUro. 2011; (59): 566-567
  • Model Error: The Bias – Variance Decomposition Prediction Error = Irreducible Error + Bias² + VarianceHastie T, Tibshirani R, Friedman JH. The elements of statistical learning :Data mining, inference, and prediction. New York, NY: Springer; 2009.
  • Model caveats Association does not prove causality Models are retrospective (observational) and therefore hypothesis generating (i.e. not hypothesis proving)
  • Congestive Heart Failure Common cause for admission. Readmission in excess of 23%. Bueno, H. et al. JAMA 2010;303:2141-2147 Risk factors for readmission extensively studied. Published reviews cite over 120 studies. - Methods: Logistic regression; Cox proportional hazard - C-statistic in 0.6-0.7 range Reduction of readmission has been declared a national goal. Improved risk models have the potential to more effectively deploy targeted disease management.
  • EMR data structure Data collected for clinical workflow. Large volume - Multiple observations; repeated measures - Many interactions and interdependencies Complex dataset - Continuous, Ordinal, Nominal (low and high order), Binary - ‘High-order variable-dimension nominal variables’ Missing data: - May represent error or practice patterns Unbalanced classes Outliers and Entry errors
  • Preliminary Dataset- 1612 consecutive heart failure discharges abstracted- 1280 candidate predictors screened- Target class: Readmission at 30 days ( binary )Administrative candidate predictors Clinical candidate predictors•Admission source, status, service •Specialty medical services consulted•Age, gender, race •Specialty ancillary services consulted•Primary/secondary payers •Blood laboratory values•Primary/secondary diagnoses (names •Medications name / therapeutic classand condition categories) •Dosages of medications•Total length of stay, ICU length of stay •Patient weights during hospitalization•Hospital costs and charges •Transfusions during hospitalization•Discharge status and disposition •Nursing assessments•All-cause same-center admission in •Education topicspreceding year •Diagnostic tests ordered •Ordersets utilized Preliminary Unpublished Data
  • Benefits of Stochastic Gradient Boosting Friedman JH. Stochastic gradient boosting. Computational Statistics and Data Analysis 2002; 38(4):367- 378. Input and processing Output Does not require data  High model accuracy transformation  Classification and regression Handles large numbers of  Non-parametric application of categorical and continuous logistic , L1, L2, or Huber-M variables loss function Has mechanisms for: - Feature selection - Managing missing values - Assessing the relationship of predictors to target Robust to: - Data entry errors, Outliers, Target misclassification
  • TreeNet™ Modeling with our Dataset1 Parameters of ‘feature fit’2 Parameters of ‘feature selection’3 Elements of insight4 Putting it all together
  • TreeNet – parameters of ‘feature fit’ Do not forget the manual …..
  • Feature selection – variable importance Variable Importance Calculation  Squared relative improvement is the sum of squared improvements (in squared error risk) over all internal nodes for which the variable was chosen as the splitting variable  Measurements are relative and is customary to express largest at 100 and scale other variables appropriately
  • Insight into the modelIlluminating the ‘black box’ with partial dependence Preliminary Unpublished Data
  • Approach to feature selectionDomain ‘Neutral’ vs. Domain ‘Centric’ Domain Neutral Both Domain Centric  Start with a subset  Know your data  Use all potential based on univariate  Univariate stats predictors significance (i.e. P-  Use knowledge of value below a given  Application of Variable Importance target and predictors to level) or variance make decisions on above a given  Screening with inclusion (or rejection) threshold batteries of predictor  Forward and backward stepwise progression
  • Model VariabilityEstablishing AUC precision and accuracy Variation Accuracy / Precision  The model is fit via sampling (i.e.  S.E.M.= S.D. / sqrt ( N ) stochastic) process.  Precision (95%) ≈ 4 * S.E.M.  S.D. = 0.03 N trials 10 30 300 S.E.M. .0095 .0055 .0017 Precision (95%) .038 .022 .007
  • Precision and predictor selection STEP_1 (197) (0.531) STEP_66 (737) (0.703) 0.75 Min = 0.5057 Median = 0.6738 0.70 Mean = 0.6500 Max = 0.7034 Avg. ROC 0.65 Test ROC 0.60 0.55 0.50 AUC estimated using CV-10 ( = 10 trees)  SEM .0095 and precision (95%) of .038 Repeating CV-10 (using CVR battery) 30 times  SEM .0017 and precision (95%) of .007 Profound implication on dimensionality of model achievable without domain knowledge input.
  • How much of a change in AUC is clinically relevant ?Gain Curve complements ROC curve Preliminary Unpublished Data
  • Useful batteries for feature selectionMethods of forward and backward selection STEPWISE  Testing set to CV-10  Select predictor 1-2 at a time  Confirm with CVR battery SHAVING
  • BUILDING A MODEL **Each change confirmed with CVR (30 reps). Review partialThis is a multi-step process dependence plot.Run model with all candidate Use backward and forward Re-examine discardedpredictors. Select N highest selection to reduce predictors in smallerimportant predictors. preliminary model to a core groups. Use backwardN= 2-3 x final size 5 -15 predictors.** and forward selection.** Step 1 Step 2 Step 3 Step 4 Step 5 Run batteries to assess Review predictors and use domain parameters of feature ‘fit’. knowledge to eliminate redundant Assess model (AUC) (dependent) predictors and consider variability. Repeat as predictors of known value. ** needed through process.
  • Initial runsInformation content and irreducible error #287 (0.519) #287 (0.519) (0.436) 0.576 0.6 Cross Entropy 0.576 0.5 0.4 0.3 Train 0.2 0.1 Test 6 Nodes 0.0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Number of Trees #880 (0.518) #880 (0.518) (0.457) 0.576 0.6 Cross Entropy 0.577 0.5 0.4 0.3 Train 0.2 0.1 Test 2 Nodes 0.0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Number of Trees Preliminary Unpublished Data
  • Sample partial dependence plotsThe value of non-parametric regression Admissions within prior year ICU Days Anion Gap Initial Systolic BP Final BNP BUN-Creatinine Ratio Preliminary Unpublished Data
  • Prospective applicationAdditional heart failure discharges can be scored against the model GAIN curve ROC curve Preliminary Unpublished Causes for performance shift Data  Overfitting in the original model  Concomitant intervention programs are altering patient risk of readmission
  • Non-influential candidate predictorsModels favor continuous over binary ‘dummy’ variables Diagnoses and QualNet Condition Categories Medications and Therapeutic Categories Diagnostic Tests Ordersets Submitted Preliminary Unpublished Data
  • Lessons learned TreeNet ( stochastic gradient boosting) is extremely well suited to data structure of EMR data. Insight in to dataset is a rich feature (in and above prediction performance). Model performance variance is important in feature selection. - Consequence of limited information content in our dataset. Batteries are useful. - PARTITION – Variability assessment - CVR – Model assessment - STEPWISE – Forward selection - SHAVING – Backward selection There is great value in learning on a non-trivial dataset within a familiar domain.
  • Next steps ……Explore options to manage model variabilityand increase dimensionality of predictor set.Extend analysis of predictor interactions.Develop mechanism of ‘point-of-care’ patientscoring.Apply techniques to new problems anddataset.
  • Any Questions?