Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Healthcare Data Analytics with Extreme Tree Models

838 views

Published on

Healthcare data is messy. Tree-based models provide robust first-cut solutions to such data. I introduce various kinds of trees and how they are different from each other. After understanding these trees, you can build better custom models of your own.

Published in: Technology
  • Be the first to comment

Healthcare Data Analytics with Extreme Tree Models

  1. 1. Introduction to Healthcare Data Analytics with Extreme Tree Models Yubin Park, PhD Chief Technology Officer 1
  2. 2. Who am I • Co-founder and Chief Technology Officer of Accordion Health, Inc. • PhD from the University of Texas at Austin • Advisor: Professor JoydeepGhosh • Studied Machine Learning and Data Mining, with a special focus on healthcare data • Involved in various industry data mining projects • USAA: Life-time modeling of customers • SK Telecom: Smartphone purchase prediction, usage pattern analysis • LinkedIn Corp.: Related search keywords recommendation • Whole Foods Market: Price elasticity modeling • … 2
  3. 3. Accordion Health • Healthcare Data Analytics Company • Founded in 2014 by • Sriram Vishwanath, PhD • Yubin Park, PhD • Joyce Ho, PhD • A team of data scientists and medical professionals • Help healthcare organizations lower costs and improve qualities 3 From Health Datapalooza 2014
  4. 4. Types of Problems We Solve • Which patient is likely to be readmitted? • Which patient is likely to develop type 2 diabetes? • Which patient is likely to adhere to his medication? • How much this patient will cost this year? • How many inpatient admissions this patient will have this year? • Which physician is likely to follow our care guideline? • What star rating will our organization receive this year? • … 4
  5. 5. Healthcare Data is Messy • Data structure • Unstructured data such as EHR • Structured data such as claims • Location • Doctors’ offices, insurance companies, governments, etc. • Data definition • Different definitions for different communities • Data format • Various industry formats • Data complexity • Patients going in and out of systems • Incomplete data • Regulations & requirements • Source: Health Catalyst 5
  6. 6. My Usual Work Flow Summary Statistics Visual Inspection Data Cleansing & Feature Engineering (1) Baseline Models Extreme Tree Models Data Cleansing & Feature Engineering (2) Custom Extreme Tree Models Data Cleansing & Feature Engineering (3) Fully Customized Models 6 I start my data project by checking summary statistics, distributions, data errors, and applying simple models. Extreme Tree Models* serve as a check point before further developing customized models. *Extreme Tree Models refer to a class of models that use a tree as a base classifier.
  7. 7. Why Tree-based Models “Of all the well-known methods, decision trees come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining.” • J. H. Friedman, R. Tibshirani, and T. Hastie,. The Elements of Statistical Learning 7
  8. 8. How to Grow a Tree 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 8
  9. 9. Various Kinds of Trees – C4.5, CART 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 9 Information Gain à C4.5 Gini Impurity, Variance Reduction à CART - Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. - Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.
  10. 10. Tree à Forest • Randomization Methods • Random data sampling • Random feature sampling • Random cut-point sampling 10
  11. 11. Various Kinds of Forests – Bagged Trees 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 11 Sample with replacement, and many trees à Bagged Trees - Breiman, L. (1996b). Bagging predictors. Machine Learning, 24:2, 123–140.
  12. 12. Various Kinds of Forests – Random Subspace 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 12 Select a random subset of features Then find the best feature/cut-point - Ho, T. (1998). The Random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:8, 832–844.
  13. 13. Various Kinds of Forests – Random Forests 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 13 Sample with replacement Select a random subset of features Then find the best feature/cut-point - Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
  14. 14. Various Kinds of Trees – ExtraTrees 1. Start with a dataset 2. Pick a splitting feature 3. Pick a splitting cut-point 4. Split the dataset into two sets based on the splitting feature and cut-point 5. Repeat from Step 2 with the partitioned datasets 14 Select a random subset of (feature, cut-point) pairs Then find the best (feature, cut-point) pair - Geurts, P., Damien E., and Louis W..(2006) Extremely randomized trees. Machine learning 63.1, 3-42.
  15. 15. Again, Bias vs Variance • Bias: Error from model • Variance: Error from data • Recursive partition à fewer samples as tree grows • Split features/cut-points are susceptible to training samples • Randomization decreases variance • Image Source: Scott Fortmann-Roe 15
  16. 16. Evolution of Bias vs. Variance 16 - Geurts, P., Damien E., and Louis W..(2006) Extremely randomized trees. Machine learning 63.1, 3-42.
  17. 17. Bias Variance Trade-off 17Image Source: Scott Fortmann-Roe • Randomization Methods reduces variance • However, for some problems, reducing the bias of a model may be more critical for improving its accuracy • A very complex dataset with many variables and samples
  18. 18. Are Tree Models are High-Variance Models? • It depends… • Number of data samples • Number of features • Data complexity • Randomization Methods • Decrease Variance • But increase Bias 18 There is another way of decreasing the expected error, which - Decrease Bias - May increase variance
  19. 19. Boosting: Learn from Errors 19 Y = f0(X), where E1 = |Y-f0(X)|2 E1 = f1(X), where E2 = |Y-f1(X)|2 E2 = f2(X), where E3 = |Y-f2(X)|2 and so on...
  20. 20. Additive Model Framework • Additive Model Framework generalizes boosting, stacking, and other variants • Source: J. H. Friedman, R. Tibshirani, and T. Hastie,. The Elements of Statistical Learning (ESL) 20
  21. 21. Gradient Boosting Machine • Additive Models can be numerically optimized via Gradient Descent • Source: Wikipedia and ESL 21 - Friedman, Jerome H. (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics: 1189-1232.
  22. 22. Extreme Gradient Boosting (XGBoost) 22 Various Data Mining Competitions in Kaggle One thing they have in common: - They all used XGBoost
  23. 23. What’s so Special about XGBoost • XGBoost implements the basic idea of GBM with some tweaks, such as: • Regularization of base trees • Approximate split finding • Weighted quantile sketch • Sparsity-aware split finding • Cache-aware block structure for out-of-core computation • “XGBoost scales beyond billions of examples using far fewer resources than existing systems.” – T. Chen and C. Guestrin 23
  24. 24. Going Further Extreme • XGBoost of XGBoost • Bagging of XGBoost • Bagging of XGBoost of XGBoost of … • Stacking, Bagging, Sampling, etc. • Source: Kaggle 24
  25. 25. Real-world Example: Predict MedAdh Scores • Centers for Medicare and Medicaid Services (CMS) measures the performance of Medicare Advantage (MA) Plans via Star Rating System • Medication Adherence (MedAdh) is one of the most important quality measures in the Star Rating System • MA Plans want to know how much their MedAdh scores will change in the next two years 25
  26. 26. Predict MedAdh Scores • Where can I find data • Download from the CMS Part C and D Performance Data webpage • Constructing datasets • MedAdh Data from 2012, 2013 à Training Features, Xtrain • MedAdh Data from 2015 à Training Label, Ytrain • MedAdh Data from 2013, 2014 à Test Features, Xtest • MedAdh Data from 2016 à Test Label, Ytest 26
  27. 27. Lots of Missing Data • Not all MA plans are measured for a given year à Mean Imputation 27 X1,X2,X3,X4,X5,X6,X7,X8,X9,Y ... 71.2,72.7,69.9,75.2,75.9,71.0,1.8 -999,-999,-999,75.8,72.5,68.8,-4.8 61.8,59.4,57.7,57.3,59.3,58.3,16.7 ... -999,-999,-999,82.8,80.0,69.8,-11.8 73.8,73.2,71.8,74.5,76.1,72.9,4.5
  28. 28. Try Various Models • From simple models like Linear Regression, Decision Tree to extreme- tree models such as ExtraTrees and Gradient Boosting 28 from sklearn import linear_model from sklearn import tree from sklearn.utils import resample from sklearn.metrics import mean_squared_error from sklearn.ensemble import ExtraTreesRegressor from sklearn.ensemble import GradientBoostingRegressor
  29. 29. Try Various Models – code snippet • From simple models like Linear Regression, Decision Tree to extreme- tree models such as ExtraTrees and Gradient Boosting 29 lm = linear_model.LinearRegression() dt = tree.DecisionTreeRegressor() etr = ExtraTreesRegressor(n_estimators=100, max_depth=10) gbr = GradientBoostingRegressor(n_estimators=500, learning_rate=0.25, max_depth=8)
  30. 30. Try Various Models – results 30 $ python test.py … RMSE Results lm: 2.7125536923 dt: 3.10460672029 etr: 2.18597303421 gbr: 2.02698129388
  31. 31. Try Various Models – results 31 Extreme Tree Models exhibit significant improvements in accuracies compared to simple models. One can build more sophisticated models based on the error characteristics of these models.
  32. 32. Contact • yubin [at] accordionhealth [dot] com 32

×